Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

Previous thread: [GIT PULL] KVM updates fot 2.6.37-rc6 by Avi Kivity on Friday, December 17, 2010 - 8:17 am. (1 message)

Next thread: Add device tree support for x86, v2 by Sebastian Andrzej Siewior on Friday, December 17, 2010 - 8:33 am. (33 messages)
From: Serge E. Hallyn
Date: Friday, December 17, 2010 - 8:22 am

Following is the next version of the user namespace patchset.

The core of the set is patch 2, originally conceived of and
implemented by Eric Biederman.  The concept is to target
capabilities at user namespaces.  A task's capabilities are
now contextualized as follows (previously, capabilities had
no context):

1. For a task in the initial user namespace, the calculated
capabilities (pi, pe, pp) are available to act upon any
user namespace.

2. For a task in a child user namespace, the calculated
capabilities are available to act only on its own or any
descendent user namespace.  It has no capabilities to any
parent or unrelated user namespaces.

3. If a user A creates a new user namespace, that user has
all capabilities to that new user namespace and any of its
descendents.  (Contrast this with a user namespace created
by another user B in the same user namespace, to which this
user A has only his calculated capabilities)

All existing 'capable' checks are automatically converted to
checks against the initial user namespace.  This means that
by default, root in a child user namespace is powerless.
Patches 3-5 begin to enable capabilities in child user
namespaces to set hostnames, kill tasks, and do ptrace.

My near-term next goals will be to enable setuid and setgid,
and to provide a way for the filesystem to be usable in child
user namespaces.  At the very least I'd like a fresh loopback
or LVM mount and proc mounts to be supported.

thanks,
-serge
--

From: Serge E. Hallyn
Date: Friday, December 17, 2010 - 8:24 am

copy_process() handles CLONE_NEWUSER before the rest of the
namespaces.  So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
the new uts namespace will have the new user namespace as its
owner.  That is what we want, since we want root in that new
userns to be able to have privilege over it.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
 include/linux/utsname.h |    3 +++
 init/version.c          |    2 ++
 kernel/nsproxy.c        |    3 +++
 kernel/user.c           |    8 ++++++--
 kernel/utsname.c        |    4 ++++
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 69f3997..85171be 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -37,9 +37,12 @@ struct new_utsname {
 #include <linux/nsproxy.h>
 #include <linux/err.h>
 
+struct user_namespace;
+
 struct uts_namespace {
 	struct kref kref;
 	struct new_utsname name;
+	struct user_namespace *user_ns;
 };
 extern struct uts_namespace init_uts_ns;
 
diff --git a/init/version.c b/init/version.c
index 79fb8c2..9eb19fb 100644
--- a/init/version.c
+++ b/init/version.c
@@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
 int version_string(LINUX_VERSION_CODE);
 #endif
 
+extern struct user_namespace init_user_ns;
 struct uts_namespace init_uts_ns = {
 	.kref = {
 		.refcount	= ATOMIC_INIT(2),
@@ -33,6 +34,7 @@ struct uts_namespace init_uts_ns = {
 		.machine	= UTS_MACHINE,
 		.domainname	= UTS_DOMAINNAME,
 	},
+	.user_ns = &init_user_ns,
 };
 EXPORT_SYMBOL_GPL(init_uts_ns);
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f74e6c0..5a22dcf 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -74,6 +74,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		err = PTR_ERR(new_nsp->uts_ns);
 		goto out_uts;
 	}
+	put_user_ns(new_nsp->uts_ns->user_ns);
+	new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
+	get_user_ns(new_nsp->uts_ns->user_ns);
 
 ...
From: Serge E. Hallyn
Date: Friday, December 17, 2010 - 8:25 am

- Introduce ns_capable to test for a capability in a non-default
  user namespace.
- Teach cap_capable to handle capabilities in a non-default
  user namespace.

The motivation is to get to the unprivileged creation of new
namespaces.  It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.

I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.

Changelog:
	11/05/2010: [serge] add apparmor
	12/14/2010: [serge] fix capabilities to created user namespaces
	Without this, if user serge creates a user_ns, he won't have
	capabilities to the user_ns he created.  THis is because we
	were first checking whether his effective caps had the caps
	he needed and returning -EPERM if not, and THEN checking whether
	he was the creator.  Reverse those checks.
	12/16/2010: [serge] security_real_capable needs ns argument in !security case

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
 include/linux/capability.h |    7 +++++--
 include/linux/security.h   |   22 ++++++++++++----------
 kernel/capability.c        |   22 ++++++++++++++++++++--
 security/apparmor/lsm.c    |    5 +++--
 security/commoncap.c       |   40 +++++++++++++++++++++++++++++++++-------
 security/security.c        |   12 ++++++------
 security/selinux/hooks.c   |   14 +++++++++-----
 7 files changed, 88 insertions(+), 34 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 90012b9..cc3e976 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -541,7 +541,7 @@ extern const kernel_cap_t __cap_init_eff_set;
  *
  * Note that this does not set PF_SUPERPRIV on the task.
  */
-#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0)
+#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0)
 
 /**
  * has_capability_noaudit - ...
From: Serge E. Hallyn
Date: Friday, December 17, 2010 - 8:26 am

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
 kernel/sys.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 2745dcd..9b9b03b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1171,7 +1171,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, len)
 	int errno;
 	char tmp[__NEW_UTS_LEN];
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 	if (len < 0 || len > __NEW_UTS_LEN)
 		return -EINVAL;
-- 
1.7.0.4

--

From: Serge E. Hallyn
Date: Friday, December 17, 2010 - 8:26 am

Changelog:
	Dec 8: Fixed bug in my check_kill_permission pointed out by
	       Eric Biederman.
	Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
	        for clarity

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
 kernel/signal.c |   33 ++++++++++++++++++++++++++++-----
 1 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..499bd36 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,6 +636,33 @@ static inline bool si_fromuser(const struct siginfo *info)
 }
 
 /*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+	struct cred *cred = current_cred();
+	struct cred *tcred = __task_cred(t);
+
+	if (cred->user->user_ns != tcred->user->user_ns) {
+		/* userids are not equivalent - either you have the
+		   capability to the target user ns or you don't */
+		if (ns_capable(tcred->user->user_ns, CAP_KILL))
+			return 1;
+		return 0;
+	}
+
+	/* same user namespace - usual credentials checks apply */
+	if ((cred->euid ^ tcred->suid) &&
+	    (cred->euid ^ tcred->uid) &&
+	    (cred->uid  ^ tcred->suid) &&
+	    (cred->uid  ^ tcred->uid) &&
+	    !ns_capable(tcred->user->user_ns, CAP_KILL))
+		return 0;
+
+	return 1;
+}
+
+/*
  * Bad permissions for sending the signal
  * - the caller must hold the RCU read lock
  */
@@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct siginfo *info,
 	cred = current_cred();
 	tcred = __task_cred(t);
 	if (!same_thread_group(current, t) &&
-	    (cred->euid ^ tcred->suid) &&
-	    (cred->euid ^ tcred->uid) &&
-	    (cred->uid  ^ tcred->suid) &&
-	    (cred->uid  ^ tcred->uid) &&
-	    !capable(CAP_KILL)) {
+	    !kill_ok_by_cred(t)) {
 		switch (sig) {
 		case SIGCONT:
 			sid = task_session(t);
-- 
1.7.0.4

--

From: Serge E. Hallyn
Date: Friday, December 17, 2010 - 8:27 am

ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace).  ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
 include/linux/capability.h |    2 ++
 kernel/ptrace.c            |   40 ++++++++++++++++++++++++++++------------
 security/commoncap.c       |   26 +++++++++++++++++++++-----
 3 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index cc3e976..777a166 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set;
  */
 #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0)
 
+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) == 0)
+
 /**
  * has_capability_noaudit - Determine if a task has a superior capability available (unaudited)
  * @t: The task in question
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..aed24eb 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -116,6 +116,19 @@ int ptrace_check_attach(struct task_struct *child, int kill)
 	return ret;
 }
 
+static inline int may_ptrace_ns(struct task_struct *t)
+{
+	struct user_namespace *ns;
+	int ret;
+
+	rcu_read_lock();
+	ns = task_cred_xxx(t, user)->user_ns;
+	ret = ns_capable(ns, CAP_SYS_PTRACE);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
 	const struct cred *cred = current_cred(), *tcred;
@@ -134,21 +147,24 @@ int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 		return 0;
 	rcu_read_lock();
 	tcred = __task_cred(task);
-	if ((cred->uid != tcred->euid ||
-	     cred->uid != tcred->suid ||
-	     cred->uid != tcred->uid  ||
-	     cred->gid != tcred->egid ||
-	     ...
From: Eric W. Biederman
Date: Friday, December 17, 2010 - 12:45 pm

The uid equality check below is broken.


Can we name this ptrace_capable?  Since you are only
wrapping the capability check?  With a name like may_ptrace_ns


This probably deserves a comment about why cap_issubset isn't
needed here.  Aka we implicitly have all caps in child user namespaces
--

From: Serge Hallyn
Date: Friday, December 17, 2010 - 1:04 pm

Thanks for reviewing, Eric.


Hm, the whole structure here could probably stand to be improved
anyway.  I just can't quite think how.  I'll rename it as you

Hm, I started to explain why it doesn't, but you're right.
If any of the uids are different, then you must have

(going strictly by the rules which fall out from the original intent
of ns_capable) :

There is a case where that isn't true - if I'm user B in userns 3, and
user A in userns 3 created the userns 4 in which this target task, owned
by user C, sits.  Then user B does not have all capabilities to userns 4,
but any calculated capabilities which B has, are also valid in userns 4.

I'd still claim that capabilities aren't really comparable (because
they are targeted at different user namespaces), and therefore the
CAP_SYS_PTRACE should be sufficient for this case.  But maybe that's
not as practical.  Maybe the cap_issubset check should be there after
--

From: Serge E. Hallyn
Date: Friday, December 31, 2010 - 9:47 pm

Thanks for the review, Eric.  Updated version appended.  Assuming there
are no big problems with this version, I hope to do setuid/setgid and
start the simplest vfs access controls next.

Subject: [PATCH 5/5] Allow ptrace from non-init user namespaces

ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace).  ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.

Changelog:
	Dec 31: Address feedback by Eric:
		. Correct ptrace uid check
		. Rename may_ptrace_ns to ptrace_capable
		. Also fix the cap_ptrace checks.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
 include/linux/capability.h     |    2 +
 include/linux/user_namespace.h |    9 +++++++
 kernel/ptrace.c                |   40 +++++++++++++++++++++++----------
 kernel/user_namespace.c        |   16 +++++++++++++
 security/commoncap.c           |   48 +++++++++++++++++++++++++++++++++------
 5 files changed, 95 insertions(+), 20 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index cc3e976..777a166 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set;
  */
 #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0)
 
+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) == 0)
+
 /**
  * has_capability_noaudit - Determine if a task has a superior capability available (unaudited)
  * @t: The task in question
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8178156..91c4f10 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
 uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t uid);
 gid_t user_ns_map_gid(struct ...
From: Eric W. Biederman
Date: Friday, December 17, 2010 - 12:31 pm

From: Serge E. Hallyn
Date: Friday, December 17, 2010 - 1:09 pm

Just to make sure I understand right: you mean wait until after the
--

From: Eric W. Biederman
Date: Friday, December 17, 2010 - 1:17 pm

I mean cred and tcred are only use in kill_ok_by_cred.


Eric
--

From: Serge E. Hallyn
Date: Friday, December 17, 2010 - 1:22 pm

D'oh.  Should've looked at the original tree, not the context.  Got it,
thanks.

-serge
--

From: Serge E. Hallyn
Date: Friday, December 31, 2010 - 9:45 pm

Thanks for the review.  Here is an updated version.

Subject: [PATCH 4/5] allow killing tasks in your own or child userns

Changelog:
	Dec 8: Fixed bug in my check_kill_permission pointed out by
	       Eric Biederman.
	Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
	        for clarity
	Dec 31: address comment by Eric Biederman:
		don't need cred/tcred in check_kill_permission.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
 kernel/signal.c |   36 ++++++++++++++++++++++++++++--------
 1 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..d890c99 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,13 +636,39 @@ static inline bool si_fromuser(const struct siginfo *info)
 }
 
 /*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+	struct cred *cred = current_cred();
+	struct cred *tcred = __task_cred(t);
+
+	if (cred->user->user_ns != tcred->user->user_ns) {
+		/* userids are not equivalent - either you have the
+		   capability to the target user ns or you don't */
+		if (ns_capable(tcred->user->user_ns, CAP_KILL))
+			return 1;
+		return 0;
+	}
+
+	/* same user namespace - usual credentials checks apply */
+	if ((cred->euid ^ tcred->suid) &&
+	    (cred->euid ^ tcred->uid) &&
+	    (cred->uid  ^ tcred->suid) &&
+	    (cred->uid  ^ tcred->uid) &&
+	    !ns_capable(tcred->user->user_ns, CAP_KILL))
+		return 0;
+
+	return 1;
+}
+
+/*
  * Bad permissions for sending the signal
  * - the caller must hold the RCU read lock
  */
 static int check_kill_permission(int sig, struct siginfo *info,
 				 struct task_struct *t)
 {
-	const struct cred *cred, *tcred;
 	struct pid *sid;
 	int error;
 
@@ -656,14 +682,8 @@ static int check_kill_permission(int sig, struct siginfo *info,
 	if (error)
 		return error;
 
-	cred = current_cred();
-	tcred = __task_cred(t);
 	if ...
From: Eric W. Biederman
Date: Saturday, January 1, 2011 - 4:10 pm

This one looks good.

--

From: Serge E. Hallyn
Date: Sunday, January 2, 2011 - 7:39 am

(Note, here and elsewhere these should have been const.  My next posting,
with some new patches, will have that fix)

thanks,
-serge
--

From: Alexey Dobriyan
Date: Friday, December 17, 2010 - 8:56 am

+1 is for init_nsproxy ;-)
--

From: Alexey Dobriyan
Date: Friday, December 17, 2010 - 9:00 am

Err, no.

+1 is to ensure it's never freed, but since allocator will BUG_ON (?) if fed
with static object, maybe it's completely bogus.
--

From: Serge Hallyn
Date: Friday, December 17, 2010 - 9:17 am

Right.  On the one hand so long as it's high enough it doesn't really matter.
But I think it's worth getting the # exactly right in order to help document
what's going on.  So I'll try dropping it back down to 2.

thanks,
-serge
--

From: Serge Hallyn
Date: Friday, December 17, 2010 - 9:12 am

Hmm, actually user_namespace isn't in struct nsproxy any more,
since 18b6e0414e42d95183f07d8177e3ff0241abd825.

I think my original thought was that init_user_ns is pinned twice by
root_user - once as creator and once as the root user in it.  In
which case it isn't right - the user_ns pins the task which created
it, the creator does not pin the user_ns it creates.

-serge
--

From: Greg KH
Date: Friday, December 17, 2010 - 10:31 am

Wait, WTF?

You have a static kref and you try to automatically instanciate it here?
As it's static, why are you even having a kref at all, what good does it
do you, you can't delete the thing, it's always around, so just remove
it entirely please.

Or, dynamically create it properly.  In other words, this is majorly
broken.

thanks,

greg k-h
--

From: Eric W. Biederman
Date: Friday, December 17, 2010 - 12:26 pm

There is a very weird case for the data structures the initial task has
references to.  The initial task never goes away and so those data
structure never go away.  Furthermore we need many of those data
structures before we have a memory allocator ready.  So we statically
allocate a single data structure and up it's reference count to ensure
that the count never goes to zero.

There are also major benefits to have the version of something that is
never freed never going away, because it means you can just reference it
in code.  So while I would be happy to say this is special don't use a
kref and roll the reference counting logic by hand, we aren't
dynamically allocating init_uts_ns any time soon.

Eric
--

From: Greg KH
Date: Friday, December 17, 2010 - 12:58 pm

Why not just dynamically create this structure once and then, if what
you say is really true, you never have to worry about the structure

Why have a reference count at all if it's not needed or used here?

confused,

greg k-h
--

From: Eric W. Biederman
Date: Friday, December 17, 2010 - 1:40 pm

We have to reference count every other uts namespace.

Eric
--

From: Greg KH
Date: Friday, December 17, 2010 - 4:15 pm

Ok, that makes sense, then also please dynamically create this one, do
not create a static kref.

thanks,

greg k-h
--

From: Eric W. Biederman
Date: Friday, December 17, 2010 - 11:32 pm

Nope.  It's a bad idea.  It messes up the kernel bootstrap if you do
that, and it makes this one structure different from every other
structure init_task uses.

Eric
--

From: Greg KH
Date: Saturday, December 18, 2010 - 10:56 am

{sigh}

Ok, but I really don't like this use.

Also, don't go messing with that ATOMIC_INIT() to be a higher value, as
this patch series did, as that really implies that it is being used
incorrectly, right?

thanks,

greg k-h
--

From: Serge Hallyn
Date: Friday, December 17, 2010 - 12:46 pm

Can't delete this one, but can delete all the uts namespaces, obviously.
As with init_tgcred in kernel/cred.c.

It's initialized with a refcount which will keep it from ever getting

If we create it dynamically, then I don't think we can use it the way we
do in kernel/utsname_sysctl.c for instance.

thanks,
-serge
--

From: Greg KH
Date: Friday, December 17, 2010 - 12:57 pm

That's my point.  You are getting "lucky" by creating a static kref.

Why not?  It's just a pointer to the structure instead of the structure
itself, right?

thanks,

greg k-h
--

Previous thread: [GIT PULL] KVM updates fot 2.6.37-rc6 by Avi Kivity on Friday, December 17, 2010 - 8:17 am. (1 message)

Next thread: Add device tree support for x86, v2 by Sebastian Andrzej Siewior on Friday, December 17, 2010 - 8:33 am. (33 messages)