Break the core functionality of set{fs,res}{u,g}id into cred_setX
which performs the access checks based on current_cred(), but performs
the requested change on a passed-in cred.
Export the helpers, since p9auth can be compiled as a module. It
might be worth not allowing modular p9auth to avoid having to export
them.
Really the setfs{u,g}id helper isn't needed, but move it as
well to keep the code consistent.
This patch also changes set_user() to use new->user->user_ns. While
technically not needed as all callers should have new->user->user_ns
equal to current_userns(), it is more correct and may prevent surprises
in the future.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/cred.h | 12 +++++
kernel/cred.c | 119 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 131 ++++++++------------------------------------------
3 files changed, 151 insertions(+), 111 deletions(-)
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 52507c3..0ce7a50 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -22,6 +22,9 @@ struct user_struct;
struct cred;
struct inode;
+/* defined in sys.c, used in cred_setresuid */
+extern int set_user(struct cred *new);
+
/*
* COW Supplementary groups list
*/
@@ -396,4 +399,13 @@ do { \
*(_fsgid) = __cred->fsgid; \
} while(0)
+#define CRED_SETID_NOFORCE 0
+#define CRED_SETID_FORCE 1
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid,
+ int force);
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid,
+ int force);
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid);
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid);
+
#endif /* _LINUX_CRED_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index e1dbe9e..4fc3284 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -785,6 +785,125 @@ int set_create_files_as(struct cred *new, struct inode *inode)
}
...Granting userid capabilities to another task is a dangerous
privilege. Don't just let file permissions authorize it.
Define CAP_GRANT_ID as a new capability needed to write to
/dev/caphash.
For one thing this lets us start a factotum server early on
in init, then have init drop CAP_GRANT_ID from its bounding
set so the rest of the system cannot regain it.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/capability.h | 6 +++++-
security/selinux/include/classmap.h | 2 +-
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 39e5ff5..ba2cbfe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -355,7 +355,11 @@ struct cpu_vfs_cap_data {
#define CAP_MAC_ADMIN 33
-#define CAP_LAST_CAP CAP_MAC_ADMIN
+/* Allow granting setuid capabilities through p9auth /dev/caphash */
+
+#define CAP_GRANT_ID 34
+
+#define CAP_LAST_CAP CAP_GRANT_ID
#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 8b32e95..f0ec53a 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -142,7 +142,7 @@ struct security_class_mapping secclass_map[] = {
"node_bind", "name_connect", NULL } },
{ "memprotect", { "mmap_zero", NULL } },
{ "peer", { "recv", NULL } },
- { "capability2", { "mac_override", "mac_admin", NULL } },
+ { "capability2", { "mac_override", "mac_admin", "grant_id", NULL } },
{ "kernel_service", { "use_as_override", "create_files_as", NULL } },
{ "tun_socket",
{ COMMON_SOCK_PERMS, NULL } },
--
1.7.0.4
--
Isn't there some man page that also needs to be updated when you add something like this to the user/kernel api? thanks, greg k-h --
This is a driver that adds Plan 9 style capability device implementation. See Documentation/p9auth.txt for a description of how to use this. This driver allows the implementation of completely unprivileged login daemons. However, doing so requires a fundamental change regarding linux userids: a server privileged with the new CAP_GRANT_ID capability can create a one-time setuid capability allowing another process to change to one specific new userid. This is a change which must be discussed. The use of this privilege can be completely prevented by having init remove CAP_GRANT_ID from its capability bounding set before forking any processes. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> --- Documentation/p9auth.txt | 47 ++++ drivers/char/Kconfig | 2 + drivers/char/Makefile | 2 + drivers/char/p9auth/Kconfig | 9 + drivers/char/p9auth/Makefile | 1 + drivers/char/p9auth/p9auth.c | 517 ++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 578 insertions(+), 0 deletions(-) create mode 100644 Documentation/p9auth.txt create mode 100644 drivers/char/p9auth/Kconfig create mode 100644 drivers/char/p9auth/Makefile create mode 100644 drivers/char/p9auth/p9auth.c diff --git a/Documentation/p9auth.txt b/Documentation/p9auth.txt new file mode 100644 index 0000000..14a69d8 --- /dev/null +++ b/Documentation/p9auth.txt @@ -0,0 +1,47 @@ +The p9auth device driver implements a plan-9 factotum-like +capability API. Tasks which are privileged (authorized by +possession of the CAP_GRANT_ID privilege (POSIX capability)) +can write new capabilities to /dev/caphash. The kernel then +stores these until a task uses them by writing to the +/dev/capuse device. Each capability represents the ability +for a task running as userid X to switch to userid Y and +some set of groups. Each capability may be used only once, +and unused capabilities are cleared after two minutes. + +The following examples shows how to use the API. Shell ...
Hm, you didn't originally write this driver, so it would be good to get some original authorship information in here to keep everything correct, Is this code really ready for drivers/char/? What has changed in it that makes it ok to move out of the staging tree? That is incorrect, you don't need the cap_major/minor files at all, the device node should be automatically created for you, right? And do you really want to do all of this control through a device node? One might hope that today would be that day... Also, please run this through sparse. This is a variable that doesn't Do you really need to do it this way? A cdev for a single device node? Just always use a dynamic misc device, you never need a whole major for We have a built-in function for this already. Oh, this function is also incorrect, which is a good reason to use the built-in ones. I'm going to stop now, this doesn't belong in drivers/char/ yet, it needs work... thanks, Hm, wait, one more.... The kernel/user api is going to change some time in the future? Please fix this now before it gets merged. --
That's why I left the MODULE_AUTHOR line in there - not sure what else to do for that. I'll add a comment in p9auth.txt, especially It was dropped from staging :) I don't particularly care to see it go back into staging, as opposed to working out issues out of tree (assuming they are solvable). For one thing, as you note below, there is the question of whether it should be a device driver at Well... At first I was thinking same as you were. So I was going to switch to a pure syscall-based approach. But it just turned out more complicated. The factotum server would call sys_grantid(), and the target task would end up doing some huge sys_setresugid() or else multiple syscalls using the granted id. It just was uglier. I think there's an experimental patchset sitting somewhere I could point to (if I weren't embarassed :). Another possibility would be to use netlink, but that doesn't appear as amenable to segragation by user namespaces. The pid (presumably/hopefully global pid, as __u32) is available, so it shouldn't be impossible, but a simple device with simple synchronous read/write certainly has its appeal. Firing off a message hoping Well there's nothing actually wrong with the cap_mutex. I put in this comment thinking I'd switch to rcu at some point, but it doesn't Right - will switch that over, assuming we don't nix the whole idea I haven't asked Ashwin, and will be happy either way. Ashwin, did Really I prefer to get rid of that altogether. Anyone who wants Huh, no that change already happened. thanks, -serge --
Well, since you have been driving a lot of implementation changes recently, I feel it makes sense for you to be a maintainer. Honestly I don't think I will be able to find time maintaining this in the future although I can do the code reviews for you. I will be more than happy if this feature continues to get traction. - Ashwin --
Ok. I'll put myself in MAINTAINERS. If we stick with the p9auth driver or filesystem approach, I'd like to keep you in MODULE_AUTHOR() if that's ok, bc it really is yours, and wouldn't exist without you. thanks, -serge --
pid in the netlink context is the netlink port-id. It is a very different concept from struct pid. These days netlink calls to the kernel are synchronous, not that I would encourage netlink for anything except networking code. Can we make this a trivial filesystem? I expect that would match up better with whatever plan9 userspace apps already exist, remove the inode double translation, and would make it much more reasonable to do a user namespace aware version if and when that becomes necessary. Eric --
Great idea, and so obvious once you mention it. I think that's the way to go. Thanks, Eric. -serge --
An fs actually seems overkill for two write-only files for process-related information. Would these actually be candidates for new /proc files? /proc/grantcred - replaces /dev/caphash, for privileged tasks to tell the kernel about new setuid capabilities /proc/self/usecred - replaces /dev/capuse for unprivileged tasks to make use of a setuid capability -serge --
An fs is fine. To relate this to Plan 9, where it all began, might be useful. There's no equivalent in Plan 9 to Linux/Unix devices of the major/minor number etc. variety. In-kernel drivers and out-of-kernel servers both end up providing the services (i.e. file name spaces) that we see in a Linux file system. So the Plan 9 driver for the capability device really does match closely in function and interface to a Linux kernel-based file system. Hence, making devcap a file system is entirely appropriate, because it best fits the way it works in Plan 9: a kernel driver that provides two files. It's pretty easy to write a Linux VFS anyway, so it makes sense from that point of view. Eric, that was a great suggestion. ron --
A fs provides user space policy control of naming. I.e. where the two files go. That can also be a very big deal. Especially when files are writable. You have no idea how much I am frustrated by sysfs right now, because it does not provide userspace policy control and instead mandates a sometimes inappropriate naming convention. Eric --
Well I'm not convinced that it's a worthwhile tradeoff for polluting /proc/filesystems and needing yet another fs mounted in each container, but a preliminary working version using an fs is at http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/sergeh/linux-cr.git;a=shortlog;h=r... I'll do some cleanup before sending it out. Eric, I'd said that the device-based version was namespace-aware, but that meant that you could on grant and use capabilities in your own user namespace. I suppose now that it's an fs we can do better semantics, where each user ns can mount its own p9auth, and anyone with CAP_GRANT_ID targeted at some user ns (i.e. root in a user_ns or the creator of a user_ns) can grant ids to that user ns. Though I'm not sure that's a feature anyone would ever use, and I do like the simplicity of just having one sb. -serge --
Which is a minor back compat issue - but you could start without it and allow init to add it. It seems a very complex interface to do a simple thing. A long time ago there was discussion around extending the AF_UNIX fd passing to permit 'pass handle and auth' so you could send someone a handle with a "become me" token attached. Alan --
Hi Alan, sorry I thought I had cc:d you, bc I was pretty sure you'd have some neat ideas. Like this one. One could try to argue that this makes every linux process susceptible to a trojan making it grant its userid to other tasks, but of course that's silly since the trojan could just fork. Well, what this would buy the attacker is the ability to sit inconspicuously under his old userid, holding on to the fd until the admin goes out to coffee before switching userids. The other thing is that offhand I think the server can't easily tell from the socket which user namespace the client is in, as ucred only has .uid. Though (1) we might need to create a 'struct puser' analogous to 'struct pid' for signals anyway, (2) userspace can segragate with fs or net_ns (if abstract sock), and (3) client in a container presumably won't be able to authenticate itself to server on the host anyway. Ashwin (and Ron), I think this idea will give us the same tools that the p9auth driver does, perhaps in a more unix-y way. Would you have objections, or do you see shortcomings? Thanks, Alan. -serge --
Possibly, could you put a timestamp in the passed object ? (Given the That I think needs fixing anyway and now is a good time as any to do it. If you are going to be able to pass userids then you need to be sure you don't get passed a handle with a uid change on it that you didn't want so adding another object type and option of some kind to accept them has to occur anyway. That also btw needs fixing for other reasons - more than one daemon has been written that generically uses recvmsg and so can be attacked with FD leaks >-) Alan --
Yup, in fact I do that for the p9auth device driver anyway, to avoid
Ignoring namespaces for a moment, I guess we could do something like
struct credentials_pass {
pid_t global_pid;
unsigned long unique_id;
uid_t new_uid;
gid_t new_gid;
int num_aux_gids;
gid_t aux_gids[];
}
The unique_id is assigned by the kernel; the client reads the whole
struct credentials_pass - except global_pid - from the unix sock,
verifies the desired credentials, then does sys_acceptid(unique_id)
to accept.
Again, how to pass user namespace information is not obvious to me.
And it's probably debatable whether we want to also pass info about
posix capabilities and LSM security data.
Still, my main concern with this approach is that it makes the server
more complicated. Since I'm passing along my current credentials, that
means I have to construct just the right credentials, which if using
sock identified by pathname may mean I can no longer write to the
socket, right?
And of course one other important consideration is that the overall
p9auth API has been heavily discussed and documented, whereas
implementing our own entirely new approach is breaking new ground
Yup.
(By 'needs fixing' you just mean needs to be done right for this
service? Else I think I'm missing something...)
thanks,
-serge
--
This looks surprising like what I am doing in passing uids and pids through unix domain sockets. So if this looks like a direction we want to go it shouldn't be too Remember my unix domain socket and the patch for converting struct cred into a new context, from a month or so ago. I think that is what we are talking about. Eric --
Zoinks! After some digging I found it in my containers.mbox and at https://lists.linux-foundation.org/pipermail/containers/2010-March/023405.html and see you even called me out. Sorry! I see your tree at http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/ebiederm/linux-2.6.33-nsfd-v5.git;... and commit "af_unix: Allow SO_PEERCRED to work across namespaces", and it all looks good. Definately useful for a SO_PASSCRED or somesuch implementation. thanks, -serge --
Hmm... for an alternative idea: We have this nice "kernel keyring" infrastructure that lets us stuff arbitrary things into "keys" and grant/revoke them between processes. What if we created a relatively generic way for processes to package up privileges (of whatever form) into a "key" that could be granted to another process (via UNIX-domain socket)? Then the other process would use a setuid()-ish syscall which would instead apply a specific key as your credentials, possibly including the audit context and/or namespaces it came from. By using the keyring system, such tokens could be kept around across multiple processes easily (as opposed to FDs), in the same style as a "sudo" ticket file, for example (even with an expiration time). Types of credentials you could pass around: * Capabilities * Filesystem UID/GID in a particular UID namespace (for FS operations) * Process UID/GID in a particular UID namespace (for kill(), etc) * Audit contexts * SELinux/etc security labels All of the above could be optionally limited to effectively require a bprm-secure-style exec() with specific args. So for example, instead of making "/usr/sbin/passwd" a setuid program, you could make it be an unprivileged helper. It would connect to a privileged daemon and ask for a password-change cookie for that particular user. The daemon would create what is essentially a "delayed exec" key which grants a specific UID and capabilities when that process performs an execkey(). So as an example, you could rewrite "sudo" as a partially-privileged daemon and an unprivileged helper. The unpriv helper would send across a request (optionally including the command and environment) which would be checked by the daemon. It would then issue a key to allow the unpriv helper to perform a limited exec. Another option would be to rewrite network login programs (eg OpenSSH) to use this for privilege separation. The listening process would get a non-expiring key to allow it to exec a ...
Hi Kyle, it's a neat idea in terms of flexibility - but in this case the flexibility really scares me :) In particular I don't like the ability to keep these tokens around for later use, and your example of passing around capabilities is something we've specifically rejected in the past. That doesn't mean that it wouldn't be a good idea to take your general approach, use it very inflexibly, and then slowly, as we gain experience, add flexibility. I think we are best off starting with two options: 1. pass audit credentials only 2. pass full credentials credentials meaning uid/gid, though - and let capabilities be Note that the last two examples are possible with with the simple p9auth driver, and are really their motivation. In fact, we only need a single privileged back-end (factotum) server to server a bunch of unprivileged front-end servers. So over the next few days I will incorporate a bunch of the feedback on the p9auth driver itself, turn it into a filesystem, and resend it. That's not to say that I don't like the other suggestions - I want to pursue those as well and we can see where we end up and which approach we prefer. For a first step I look forward to Eric pushing the patches to do uid and pid translation for unix sock credentials. thanks, -serge --
If you do go down this path there is a separate (and actually completely opposite) but related problem I might be able and willing to work with you on. When looking at how auditing works in this modern day and age of dbus+polkit to get background processes to do work on behalf of a user we were discussing an interface that would pass the information about the user to the background server process. The background server process could do some magic such that it still had all the permissions and rights of itself, but had the audit information of the original user. Thus even though it was a server process with uid=0 that did the work, the audit logs could know it was actually on behalf of uid=500. It was discussed passing that token of audit information over an AF_UNIX socket. -Eric --
This actually brings up an issue I've been a bit worried about: is credentials passing for dbus adequate? I thought that the last time I looked through some code, there was no way in particular for upstart to pass posix capabilities info along. What that means is that as root I can do capsh --drop=(list of all capabilities) -- reboot and, although I don't have cap_sys_boot, I can reboot the system. So the only way I can prevent a container from rebooting the host is to start it in a fresh network namespace to segrate the abstract unix domain sockets. But if I don't want a fresh network namespace, I'm out --
Can you mark these extern please? Other than that, these functions should probably be mentioned in Documentation/credentials.txt. David --
If you make this: if (ret == 0) return commit_creds(new); abort_creds(new); return ret; then gcc can tail-call commit_creds(), which is guaranteed to return 0. David --
