Re: [PATCH 1/3] p9auth: split core function out of some set*{u,g}id functions

Previous thread: Question on 2.6.25 kernel crash. by Viswanathan Murugesan on Tuesday, April 20, 2010 - 6:21 pm. (2 messages)

Next thread: [Patch 1/1] init: Provide a kernel start parameter to increase pid_max by Mike Travis on Tuesday, April 20, 2010 - 6:40 pm. (32 messages)
From: Serge E. Hallyn
Date: Tuesday, April 20, 2010 - 6:27 pm

Break the core functionality of set{fs,res}{u,g}id into cred_setX
which performs the access checks based on current_cred(), but performs
the requested change on a passed-in cred.

Export the helpers, since p9auth can be compiled as a module.  It
might be worth not allowing modular p9auth to avoid having to export
them.

Really the setfs{u,g}id helper isn't needed, but move it as
well to keep the code consistent.

This patch also changes set_user() to use new->user->user_ns.  While
technically not needed as all callers should have new->user->user_ns
equal to current_userns(), it is more correct and may prevent surprises
in the future.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 include/linux/cred.h |   12 +++++
 kernel/cred.c        |  119 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c         |  131 ++++++++------------------------------------------
 3 files changed, 151 insertions(+), 111 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 52507c3..0ce7a50 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -22,6 +22,9 @@ struct user_struct;
 struct cred;
 struct inode;
 
+/* defined in sys.c, used in cred_setresuid */
+extern int set_user(struct cred *new);
+
 /*
  * COW Supplementary groups list
  */
@@ -396,4 +399,13 @@ do {						\
 	*(_fsgid) = __cred->fsgid;		\
 } while(0)
 
+#define CRED_SETID_NOFORCE 0
+#define CRED_SETID_FORCE 1
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid,
+		   int force);
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid,
+		   int force);
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid);
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid);
+
 #endif /* _LINUX_CRED_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index e1dbe9e..4fc3284 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -785,6 +785,125 @@ int set_create_files_as(struct cred *new, struct inode *inode)
 }
 ...
From: Serge E. Hallyn
Date: Tuesday, April 20, 2010 - 6:28 pm

Granting userid capabilities to another task is a dangerous
privilege.  Don't just let file permissions authorize it.
Define CAP_GRANT_ID as a new capability needed to write to
/dev/caphash.

For one thing this lets us start a factotum server early on
in init, then have init drop CAP_GRANT_ID from its bounding
set so the rest of the system cannot regain it.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 include/linux/capability.h          |    6 +++++-
 security/selinux/include/classmap.h |    2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 39e5ff5..ba2cbfe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -355,7 +355,11 @@ struct cpu_vfs_cap_data {
 
 #define CAP_MAC_ADMIN        33
 
-#define CAP_LAST_CAP         CAP_MAC_ADMIN
+/* Allow granting setuid capabilities through p9auth /dev/caphash */
+
+#define CAP_GRANT_ID         34
+
+#define CAP_LAST_CAP         CAP_GRANT_ID
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 8b32e95..f0ec53a 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -142,7 +142,7 @@ struct security_class_mapping secclass_map[] = {
 	    "node_bind", "name_connect", NULL } },
 	{ "memprotect", { "mmap_zero", NULL } },
 	{ "peer", { "recv", NULL } },
-	{ "capability2", { "mac_override", "mac_admin", NULL } },
+	{ "capability2", { "mac_override", "mac_admin", "grant_id", NULL } },
 	{ "kernel_service", { "use_as_override", "create_files_as", NULL } },
 	{ "tun_socket",
 	  { COMMON_SOCK_PERMS, NULL } },
-- 
1.7.0.4

--

From: Greg KH
Date: Tuesday, April 20, 2010 - 7:54 pm

Isn't there some man page that also needs to be updated when you add
something like this to the user/kernel api?

thanks,

greg k-h
--

From: Serge E. Hallyn
Date: Tuesday, April 20, 2010 - 6:29 pm

This is a driver that adds Plan 9 style capability device
implementation.  See Documentation/p9auth.txt for a description
of how to use this.

This driver allows the implementation of completely unprivileged
login daemons.  However, doing so requires a fundamental change
regarding linux userids: a server privileged with the new
CAP_GRANT_ID capability can create a one-time setuid capability
allowing another process to change to one specific new userid.
This is a change which must be discussed.  The use of this
privilege can be completely prevented by having init remove
CAP_GRANT_ID from its capability bounding set before forking any
processes.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 Documentation/p9auth.txt     |   47 ++++
 drivers/char/Kconfig         |    2 +
 drivers/char/Makefile        |    2 +
 drivers/char/p9auth/Kconfig  |    9 +
 drivers/char/p9auth/Makefile |    1 +
 drivers/char/p9auth/p9auth.c |  517 ++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 578 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/p9auth.txt
 create mode 100644 drivers/char/p9auth/Kconfig
 create mode 100644 drivers/char/p9auth/Makefile
 create mode 100644 drivers/char/p9auth/p9auth.c

diff --git a/Documentation/p9auth.txt b/Documentation/p9auth.txt
new file mode 100644
index 0000000..14a69d8
--- /dev/null
+++ b/Documentation/p9auth.txt
@@ -0,0 +1,47 @@
+The p9auth device driver implements a plan-9 factotum-like
+capability API.  Tasks which are privileged (authorized by
+possession of the CAP_GRANT_ID privilege (POSIX capability))
+can write new capabilities to /dev/caphash.  The kernel then
+stores these until a task uses them by writing to the
+/dev/capuse device.  Each capability represents the ability
+for a task running as userid X to switch to userid Y and
+some set of groups.  Each capability may be used only once,
+and unused capabilities are cleared after two minutes.
+
+The following examples shows how to use the API.  Shell ...
From: Greg KH
Date: Tuesday, April 20, 2010 - 8:04 pm

Hm, you didn't originally write this driver, so it would be good to get
some original authorship information in here to keep everything correct,

Is this code really ready for drivers/char/?  What has changed in it
that makes it ok to move out of the staging tree?


That is incorrect, you don't need the cap_major/minor files at all, the
device node should be automatically created for you, right?

And do you really want to do all of this control through a device node?

One might hope that today would be that day...

Also, please run this through sparse.  This is a variable that doesn't

Do you really need to do it this way?  A cdev for a single device node?

Just always use a dynamic misc device, you never need a whole major for



We have a built-in function for this already.

Oh, this function is also incorrect, which is a good reason to use the
built-in ones.

I'm going to stop now, this doesn't belong in drivers/char/ yet, it
needs work...

thanks,


Hm, wait, one more....  The kernel/user api is going to change some time
in the future?  Please fix this now before it gets merged.
--

From: Serge E. Hallyn
Date: Tuesday, April 20, 2010 - 8:45 pm

That's why I left the MODULE_AUTHOR line in there - not sure what
else to do for that.  I'll add a comment in p9auth.txt, especially

It was dropped from staging :)  I don't particularly care to see it
go back into staging, as opposed to working out issues out of tree
(assuming they are solvable).  For one thing, as you note below,
there is the question of whether it should be a device driver at



Well...

At first I was thinking same as you were.  So I was going to switch
to a pure syscall-based approach.  But it just turned out more
complicated.  The factotum server would call sys_grantid(), and
the target task would end up doing some huge sys_setresugid() or
else multiple syscalls using the granted id.  It just was uglier.
I think there's an experimental patchset sitting somewhere I could
point to (if I weren't embarassed :).

Another possibility would be to use netlink, but that doesn't
appear as amenable to segragation by user namespaces.  The pid
(presumably/hopefully global pid, as __u32) is available, so it
shouldn't be impossible, but a simple device with simple synchronous
read/write certainly has its appeal.  Firing off a message hoping

Well there's nothing actually wrong with the cap_mutex.  I put in
this comment thinking I'd switch to rcu at some point, but it doesn't

Right - will switch that over, assuming we don't nix the whole idea

I haven't asked Ashwin, and will be happy either way.  Ashwin, did

Really I prefer to get rid of that altogether.  Anyone who wants


Huh, no that change already happened.

thanks,
-serge
--

From: Ashwin Ganti
Date: Tuesday, April 20, 2010 - 9:18 pm

Well, since you have been driving a lot of implementation changes
recently, I feel it makes sense for you to be a maintainer. Honestly I
don't think I will be able to find time maintaining this in the future
although I can do the code reviews for you. I will be more than happy
if this feature continues to get traction.

- Ashwin
--

From: Serge E. Hallyn
Date: Wednesday, April 21, 2010 - 6:47 am

Ok.  I'll put myself in MAINTAINERS.  If we stick with the p9auth
driver or filesystem approach, I'd like to keep you in MODULE_AUTHOR()
if that's ok, bc it really is yours, and wouldn't exist without you.

thanks,
-serge
--

From: Ashwin Ganti
Date: Wednesday, April 21, 2010 - 7:44 am

Yeah, sure.

- Ashwin
--

From: Eric W. Biederman
Date: Tuesday, April 20, 2010 - 9:45 pm

pid in the netlink context is the netlink port-id.  It is a very
different concept from struct pid.  These days netlink calls to
the kernel are synchronous, not that I would encourage netlink
for anything except networking code.

Can we make this a trivial filesystem?  I expect that would match
up better with whatever plan9 userspace apps already exist,
remove the inode double translation, and would make it much more
reasonable to do a user namespace aware version  if and when
that becomes necessary.

Eric
--

From: Serge E. Hallyn
Date: Wednesday, April 21, 2010 - 6:21 am

Great idea, and so obvious once you mention it.  I think that's
the way to go.  Thanks, Eric.

-serge
--

From: Serge E. Hallyn
Date: Friday, April 23, 2010 - 8:36 pm

An fs actually seems overkill for two write-only files for
process-related information.  Would these actually be candidates
for new /proc files?

	/proc/grantcred - replaces /dev/caphash, for privileged
		tasks to tell the kernel about new setuid
		capabilities
	/proc/self/usecred - replaces /dev/capuse for unprivileged
		tasks to make use of a setuid capability

-serge
--

From: ron minnich
Date: Saturday, April 24, 2010 - 9:25 am

An fs is fine.

To relate this to Plan 9, where it all began, might be useful. There's
no equivalent in Plan 9 to Linux/Unix devices of the major/minor
number etc. variety. In-kernel drivers and out-of-kernel servers both
end up providing the services (i.e. file name spaces) that we see in a
Linux file system. So the Plan 9 driver for the capability device
really does match closely in function and interface to a Linux
kernel-based file system.

Hence, making devcap a file system is entirely appropriate, because it
best fits the way it works in Plan 9: a kernel driver that provides
two files.

It's pretty easy to write a Linux VFS anyway, so it makes sense from
that point of view.

Eric, that was a great suggestion.

ron
--

From: Eric W. Biederman
Date: Saturday, April 24, 2010 - 11:01 am

A fs provides user space policy control of naming.  I.e. where the two files go.
That can also be a very big deal.  Especially when files are writable.

You have no idea how much I am frustrated by sysfs right now, because
it does not provide userspace policy control and instead mandates a
sometimes inappropriate naming convention.

Eric


--

From: Serge E. Hallyn
Date: Saturday, April 24, 2010 - 8:24 pm

Well I'm not convinced that it's a worthwhile tradeoff for polluting
/proc/filesystems and needing yet another fs mounted in each container,
but a preliminary working version using an fs is at
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/sergeh/linux-cr.git;a=shortlog;h=r...

I'll do some cleanup before sending it out.

Eric, I'd said that the device-based version was namespace-aware, but
that meant that you could on grant and use capabilities in your own
user namespace.  I suppose now that it's an fs we can do better
semantics, where each user ns can mount its own p9auth, and anyone
with CAP_GRANT_ID targeted at some user ns (i.e. root in a user_ns
or the creator of a user_ns) can grant ids to that user ns.  Though
I'm not sure that's a feature anyone would ever use, and I do like
the simplicity of just having one sb.

-serge
--

From: Alan Cox
Date: Wednesday, April 21, 2010 - 2:27 am

Which is a minor back compat issue - but you could start without it and
allow init to add it.

It seems a very complex interface to do a simple thing. A long time ago
there was discussion around extending the AF_UNIX fd passing to permit
'pass handle and auth' so you could send someone a handle with a "become
me" token attached.

Alan
--

From: Serge E. Hallyn
Date: Wednesday, April 21, 2010 - 6:39 am

Hi Alan,

sorry I thought I had cc:d you, bc I was pretty sure you'd have some
neat ideas.  Like this one.

One could try to argue that this makes every linux process susceptible
to a trojan making it grant its userid to other tasks, but of course
that's silly since the trojan could just fork.  Well, what this would
buy the attacker is the ability to sit inconspicuously under his old
userid, holding on to the fd until the admin goes out to coffee before
switching userids.

The other thing is that offhand I think the server can't easily tell
from the socket which user namespace the client is in, as ucred only
has .uid.  Though (1) we might need to create a 'struct puser' analogous
to 'struct pid' for signals anyway, (2) userspace can segragate with
fs or net_ns (if abstract sock), and (3) client in a container
presumably won't be able to authenticate itself to server on the
host anyway.

Ashwin (and Ron), I think this idea will give us the same tools that
the p9auth driver does, perhaps in a more unix-y way.  Would you have
objections, or do you see shortcomings?

Thanks, Alan.

-serge
--

From: Alan Cox
Date: Wednesday, April 21, 2010 - 7:19 am

Possibly, could you put a timestamp in the passed object ? (Given the

That I think needs fixing anyway and now is a good time as any to do it.
If you are going to be able to pass userids then you need to be sure you
don't get passed a handle with a uid change on it that you didn't want so
adding another object type and option of some kind to accept them has to
occur anyway.

That also btw needs fixing for other reasons - more than one daemon has
been written that generically uses recvmsg and so can be attacked with FD
leaks >-)

Alan
--

From: Serge E. Hallyn
Date: Wednesday, April 21, 2010 - 8:09 am

Yup, in fact I do that for the p9auth device driver anyway, to avoid

Ignoring namespaces for a moment, I guess we could do something like

struct credentials_pass {
	pid_t global_pid;
	unsigned long unique_id;
	uid_t new_uid;
	gid_t new_gid;
	int num_aux_gids;
	gid_t aux_gids[];
}

The unique_id is assigned by the kernel;  the client reads the whole
struct credentials_pass - except global_pid - from the unix sock,
verifies the desired credentials, then does sys_acceptid(unique_id)
to accept.

Again, how to pass user namespace information is not obvious to me.

And it's probably debatable whether we want to also pass info about
posix capabilities and LSM security data.

Still, my main concern with this approach is that it makes the server
more complicated.  Since I'm passing along my current credentials, that
means I have to construct just the right credentials, which if using
sock identified by pathname may mean I can no longer write to the
socket, right?

And of course one other important consideration is that the overall
p9auth API has been heavily discussed and documented, whereas
implementing our own entirely new approach is breaking new ground

Yup.

(By 'needs fixing' you just mean needs to be done right for this
service?  Else I think I'm missing something...)

thanks,
-serge
--

From: Eric W. Biederman
Date: Wednesday, April 21, 2010 - 12:15 pm

This looks surprising like what I am doing in passing uids and pids
through unix domain sockets.

So if this looks like a direction we want to go it shouldn't be too

Remember my unix domain socket and the patch for converting struct cred
into a new context, from a month or so ago.  I think that is what we
are talking about.

Eric
--

From: Serge E. Hallyn
Date: Wednesday, April 21, 2010 - 1:23 pm

Zoinks!  After some digging I found it in my containers.mbox and at
https://lists.linux-foundation.org/pipermail/containers/2010-March/023405.html
and see you even called me out.  Sorry!  I see your tree at
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/ebiederm/linux-2.6.33-nsfd-v5.git;...
and commit "af_unix: Allow SO_PEERCRED to work across namespaces", and
it all looks good.  Definately useful for a SO_PASSCRED or somesuch
implementation.

thanks,
-serge
--

From: Kyle Moffett
Date: Wednesday, April 21, 2010 - 9:57 pm

Hmm... for an alternative idea:

We have this nice "kernel keyring" infrastructure that lets us stuff
arbitrary things into "keys" and grant/revoke them between processes.
What if we created a relatively generic way for processes to package
up privileges (of whatever form) into a "key" that could be granted to
another process (via UNIX-domain socket)?  Then the other process
would use a setuid()-ish syscall which would instead apply a specific
key as your credentials, possibly including the audit context and/or
namespaces it came from.

By using the keyring system, such tokens could be kept around across
multiple processes easily (as opposed to FDs), in the same style as a
"sudo" ticket file, for example (even with an expiration time).

Types of credentials you could pass around:
  * Capabilities
  * Filesystem UID/GID in a particular UID namespace (for FS operations)
  * Process UID/GID in a particular UID namespace (for kill(), etc)
  * Audit contexts
  * SELinux/etc security labels

All of the above could be optionally limited to effectively require a
bprm-secure-style exec() with specific args.  So for example, instead
of making "/usr/sbin/passwd" a setuid program, you could make it be an
unprivileged helper.  It would connect to a privileged daemon and ask
for a password-change cookie for that particular user.  The daemon
would create what is essentially a "delayed exec" key which grants a
specific UID and capabilities when that process performs an execkey().

So as an example, you could rewrite "sudo" as a partially-privileged
daemon and an unprivileged helper.  The unpriv helper would send
across a request (optionally including the command and environment)
which would be checked by the daemon.  It would then issue a key to
allow the unpriv helper to perform a limited exec.

Another option would be to rewrite network login programs (eg OpenSSH)
to use this for privilege separation.  The listening process would get
a non-expiring key to allow it to exec a ...
From: Serge E. Hallyn
Date: Thursday, April 22, 2010 - 7:36 am

Hi Kyle,

it's a neat idea in terms of flexibility - but in this case the
flexibility really scares me :)

In particular I don't like the ability to keep these tokens around
for later use, and your example of passing around capabilities is
something we've specifically rejected in the past.

That doesn't mean that it wouldn't be a good idea to take your
general approach, use it very inflexibly, and then slowly, as we
gain experience, add flexibility.

I think we are best off starting with two options:

	1. pass audit credentials only
	2. pass full credentials

credentials meaning uid/gid, though - and let capabilities be

Note that the last two examples are possible with with the simple
p9auth driver, and are really their motivation.  In fact, we only
need a single privileged back-end (factotum) server to server a bunch
of unprivileged front-end servers.

So over the next few days I will incorporate a bunch of the feedback
on the p9auth driver itself, turn it into a filesystem, and resend
it.  That's not to say that I don't like the other suggestions - I
want to pursue those as well and we can see where we end up and which
approach we prefer.  For a first step I look forward to Eric pushing
the patches to do uid and pid translation for unix sock credentials.

thanks,
-serge
--

From: Eric Paris
Date: Wednesday, April 21, 2010 - 6:55 am

If you do go down this path there is a separate (and actually completely
opposite) but related problem I might be able and willing to work with
you on.  When looking at how auditing works in this modern day and age
of dbus+polkit to get background processes to do work on behalf of a
user we were discussing an interface that would pass the information
about the user to the background server process.  The background server
process could do some magic such that it still had all the permissions
and rights of itself, but had the audit information of the original
user.  Thus even though it was a server process with uid=0 that did the
work, the audit logs could know it was actually on behalf of uid=500.  

It was discussed passing that token of audit information over an AF_UNIX
socket.

-Eric

--

From: Serge E. Hallyn
Date: Wednesday, April 21, 2010 - 7:30 am

This actually brings up an issue I've been a bit worried about: is
credentials passing for dbus adequate?  I thought that the last time
I looked through some code, there was no way in particular for upstart
to pass posix capabilities info along.  What that means is that as root
I can do

capsh --drop=(list of all capabilities) --
   reboot

and, although I don't have cap_sys_boot, I can reboot the system.  So
the only way I can prevent a container from rebooting the host is to
start it in a fresh network namespace to segrate the abstract unix
domain sockets.  But if I don't want a fresh network namespace, I'm out
--

From: David Howells
Date: Wednesday, April 21, 2010 - 3:46 am

Can you mark these extern please?

Other than that, these functions should probably be mentioned in
Documentation/credentials.txt.

David
--

From: Serge E. Hallyn
Date: Wednesday, April 21, 2010 - 6:40 am

Will do.

thanks,
-serge
--

From: David Howells
Date: Wednesday, April 21, 2010 - 3:49 am

If you make this:

	if (ret == 0)
		return commit_creds(new);
	abort_creds(new);
	return ret;

then gcc can tail-call commit_creds(), which is guaranteed to return 0.

David
--

From: Serge E. Hallyn
Date: Wednesday, April 21, 2010 - 6:40 am

Ok, will do.

thanks,
-serge
--

Previous thread: Question on 2.6.25 kernel crash. by Viswanathan Murugesan on Tuesday, April 20, 2010 - 6:21 pm. (2 messages)

Next thread: [Patch 1/1] init: Provide a kernel start parameter to increase pid_max by Mike Travis on Tuesday, April 20, 2010 - 6:40 pm. (32 messages)