Following is the next version of the user namespace patchset. The core of the set is patch 2, originally conceived of and implemented by Eric Biederman. The concept is to target capabilities at user namespaces. A task's capabilities are now contextualized as follows (previously, capabilities had no context): 1. For a task in the initial user namespace, the calculated capabilities (pi, pe, pp) are available to act upon any user namespace. 2. For a task in a child user namespace, the calculated capabilities are available to act only on its own or any descendent user namespace. It has no capabilities to any parent or unrelated user namespaces. 3. If a user A creates a new user namespace, that user has all capabilities to that new user namespace and any of its descendents. (Contrast this with a user namespace created by another user B in the same user namespace, to which this user A has only his calculated capabilities) All existing 'capable' checks are automatically converted to checks against the initial user namespace. This means that by default, root in a child user namespace is powerless. Patches 3-5 begin to enable capabilities in child user namespaces to set hostnames, kill tasks, and do ptrace. My near-term next goals will be to enable setuid and setgid, and to provide a way for the filesystem to be usable in child user namespaces. At the very least I'd like a fresh loopback or LVM mount and proc mounts to be supported. thanks, -serge --
copy_process() handles CLONE_NEWUSER before the rest of the
namespaces. So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
the new uts namespace will have the new user namespace as its
owner. That is what we want, since we want root in that new
userns to be able to have privilege over it.
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
include/linux/utsname.h | 3 +++
init/version.c | 2 ++
kernel/nsproxy.c | 3 +++
kernel/user.c | 8 ++++++--
kernel/utsname.c | 4 ++++
5 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 69f3997..85171be 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -37,9 +37,12 @@ struct new_utsname {
#include <linux/nsproxy.h>
#include <linux/err.h>
+struct user_namespace;
+
struct uts_namespace {
struct kref kref;
struct new_utsname name;
+ struct user_namespace *user_ns;
};
extern struct uts_namespace init_uts_ns;
diff --git a/init/version.c b/init/version.c
index 79fb8c2..9eb19fb 100644
--- a/init/version.c
+++ b/init/version.c
@@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
int version_string(LINUX_VERSION_CODE);
#endif
+extern struct user_namespace init_user_ns;
struct uts_namespace init_uts_ns = {
.kref = {
.refcount = ATOMIC_INIT(2),
@@ -33,6 +34,7 @@ struct uts_namespace init_uts_ns = {
.machine = UTS_MACHINE,
.domainname = UTS_DOMAINNAME,
},
+ .user_ns = &init_user_ns,
};
EXPORT_SYMBOL_GPL(init_uts_ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f74e6c0..5a22dcf 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -74,6 +74,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
+ put_user_ns(new_nsp->uts_ns->user_ns);
+ new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
+ get_user_ns(new_nsp->uts_ns->user_ns);
...- Introduce ns_capable to test for a capability in a non-default user namespace. - Teach cap_capable to handle capabilities in a non-default user namespace. The motivation is to get to the unprivileged creation of new namespaces. It looks like this gets us 90% of the way there, with only potential uid confusion issues left. I still need to handle getting all caps after creation but otherwise I think I have a good starter patch that achieves all of your goals. Changelog: 11/05/2010: [serge] add apparmor 12/14/2010: [serge] fix capabilities to created user namespaces Without this, if user serge creates a user_ns, he won't have capabilities to the user_ns he created. THis is because we were first checking whether his effective caps had the caps he needed and returning -EPERM if not, and THEN checking whether he was the creator. Reverse those checks. 12/16/2010: [serge] security_real_capable needs ns argument in !security case Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> --- include/linux/capability.h | 7 +++++-- include/linux/security.h | 22 ++++++++++++---------- kernel/capability.c | 22 ++++++++++++++++++++-- security/apparmor/lsm.c | 5 +++-- security/commoncap.c | 40 +++++++++++++++++++++++++++++++++------- security/security.c | 12 ++++++------ security/selinux/hooks.c | 14 +++++++++----- 7 files changed, 88 insertions(+), 34 deletions(-) diff --git a/include/linux/capability.h b/include/linux/capability.h index 90012b9..cc3e976 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -541,7 +541,7 @@ extern const kernel_cap_t __cap_init_eff_set; * * Note that this does not set PF_SUPERPRIV on the task. */ -#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0) +#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0) /** * has_capability_noaudit - ...
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> --- kernel/sys.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index 2745dcd..9b9b03b 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1171,7 +1171,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, len) int errno; char tmp[__NEW_UTS_LEN]; - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN)) return -EPERM; if (len < 0 || len > __NEW_UTS_LEN) return -EINVAL; -- 1.7.0.4 --
Changelog:
Dec 8: Fixed bug in my check_kill_permission pointed out by
Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
for clarity
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
kernel/signal.c | 33 ++++++++++++++++++++++++++++-----
1 files changed, 28 insertions(+), 5 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..499bd36 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,6 +636,33 @@ static inline bool si_fromuser(const struct siginfo *info)
}
/*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+ struct cred *cred = current_cred();
+ struct cred *tcred = __task_cred(t);
+
+ if (cred->user->user_ns != tcred->user->user_ns) {
+ /* userids are not equivalent - either you have the
+ capability to the target user ns or you don't */
+ if (ns_capable(tcred->user->user_ns, CAP_KILL))
+ return 1;
+ return 0;
+ }
+
+ /* same user namespace - usual credentials checks apply */
+ if ((cred->euid ^ tcred->suid) &&
+ (cred->euid ^ tcred->uid) &&
+ (cred->uid ^ tcred->suid) &&
+ (cred->uid ^ tcred->uid) &&
+ !ns_capable(tcred->user->user_ns, CAP_KILL))
+ return 0;
+
+ return 1;
+}
+
+/*
* Bad permissions for sending the signal
* - the caller must hold the RCU read lock
*/
@@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct siginfo *info,
cred = current_cred();
tcred = __task_cred(t);
if (!same_thread_group(current, t) &&
- (cred->euid ^ tcred->suid) &&
- (cred->euid ^ tcred->uid) &&
- (cred->uid ^ tcred->suid) &&
- (cred->uid ^ tcred->uid) &&
- !capable(CAP_KILL)) {
+ !kill_ok_by_cred(t)) {
switch (sig) {
case SIGCONT:
sid = task_session(t);
--
1.7.0.4
--
ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace). ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
include/linux/capability.h | 2 ++
kernel/ptrace.c | 40 ++++++++++++++++++++++++++++------------
security/commoncap.c | 26 +++++++++++++++++++++-----
3 files changed, 51 insertions(+), 17 deletions(-)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index cc3e976..777a166 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set;
*/
#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0)
+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) == 0)
+
/**
* has_capability_noaudit - Determine if a task has a superior capability available (unaudited)
* @t: The task in question
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..aed24eb 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -116,6 +116,19 @@ int ptrace_check_attach(struct task_struct *child, int kill)
return ret;
}
+static inline int may_ptrace_ns(struct task_struct *t)
+{
+ struct user_namespace *ns;
+ int ret;
+
+ rcu_read_lock();
+ ns = task_cred_xxx(t, user)->user_ns;
+ ret = ns_capable(ns, CAP_SYS_PTRACE);
+ rcu_read_unlock();
+
+ return ret;
+}
+
int __ptrace_may_access(struct task_struct *task, unsigned int mode)
{
const struct cred *cred = current_cred(), *tcred;
@@ -134,21 +147,24 @@ int __ptrace_may_access(struct task_struct *task, unsigned int mode)
return 0;
rcu_read_lock();
tcred = __task_cred(task);
- if ((cred->uid != tcred->euid ||
- cred->uid != tcred->suid ||
- cred->uid != tcred->uid ||
- cred->gid != tcred->egid ||
- ...The uid equality check below is broken. Can we name this ptrace_capable? Since you are only wrapping the capability check? With a name like may_ptrace_ns This probably deserves a comment about why cap_issubset isn't needed here. Aka we implicitly have all caps in child user namespaces --
Thanks for reviewing, Eric. Hm, the whole structure here could probably stand to be improved anyway. I just can't quite think how. I'll rename it as you Hm, I started to explain why it doesn't, but you're right. If any of the uids are different, then you must have (going strictly by the rules which fall out from the original intent of ns_capable) : There is a case where that isn't true - if I'm user B in userns 3, and user A in userns 3 created the userns 4 in which this target task, owned by user C, sits. Then user B does not have all capabilities to userns 4, but any calculated capabilities which B has, are also valid in userns 4. I'd still claim that capabilities aren't really comparable (because they are targeted at different user namespaces), and therefore the CAP_SYS_PTRACE should be sufficient for this case. But maybe that's not as practical. Maybe the cap_issubset check should be there after --
Thanks for the review, Eric. Updated version appended. Assuming there are no big problems with this version, I hope to do setuid/setgid and start the simplest vfs access controls next. Subject: [PATCH 5/5] Allow ptrace from non-init user namespaces ptrace is allowed to tasks in the same user namespace according to the usual rules (i.e. the same rules as for two tasks in the init user namespace). ptrace is also allowed to a user namespace to which the current task the has CAP_SYS_PTRACE capability. Changelog: Dec 31: Address feedback by Eric: . Correct ptrace uid check . Rename may_ptrace_ns to ptrace_capable . Also fix the cap_ptrace checks. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> --- include/linux/capability.h | 2 + include/linux/user_namespace.h | 9 +++++++ kernel/ptrace.c | 40 +++++++++++++++++++++++---------- kernel/user_namespace.c | 16 +++++++++++++ security/commoncap.c | 48 +++++++++++++++++++++++++++++++++------ 5 files changed, 95 insertions(+), 20 deletions(-) diff --git a/include/linux/capability.h b/include/linux/capability.h index cc3e976..777a166 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set; */ #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0) +#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) == 0) + /** * has_capability_noaudit - Determine if a task has a superior capability available (unaudited) * @t: The task in question diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 8178156..91c4f10 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns) uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t uid); gid_t user_ns_map_gid(struct ...
Just to make sure I understand right: you mean wait until after the --
I mean cred and tcred are only use in kill_ok_by_cred. Eric --
D'oh. Should've looked at the original tree, not the context. Got it, thanks. -serge --
Thanks for the review. Here is an updated version.
Subject: [PATCH 4/5] allow killing tasks in your own or child userns
Changelog:
Dec 8: Fixed bug in my check_kill_permission pointed out by
Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
for clarity
Dec 31: address comment by Eric Biederman:
don't need cred/tcred in check_kill_permission.
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
kernel/signal.c | 36 ++++++++++++++++++++++++++++--------
1 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..d890c99 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,13 +636,39 @@ static inline bool si_fromuser(const struct siginfo *info)
}
/*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+ struct cred *cred = current_cred();
+ struct cred *tcred = __task_cred(t);
+
+ if (cred->user->user_ns != tcred->user->user_ns) {
+ /* userids are not equivalent - either you have the
+ capability to the target user ns or you don't */
+ if (ns_capable(tcred->user->user_ns, CAP_KILL))
+ return 1;
+ return 0;
+ }
+
+ /* same user namespace - usual credentials checks apply */
+ if ((cred->euid ^ tcred->suid) &&
+ (cred->euid ^ tcred->uid) &&
+ (cred->uid ^ tcred->suid) &&
+ (cred->uid ^ tcred->uid) &&
+ !ns_capable(tcred->user->user_ns, CAP_KILL))
+ return 0;
+
+ return 1;
+}
+
+/*
* Bad permissions for sending the signal
* - the caller must hold the RCU read lock
*/
static int check_kill_permission(int sig, struct siginfo *info,
struct task_struct *t)
{
- const struct cred *cred, *tcred;
struct pid *sid;
int error;
@@ -656,14 +682,8 @@ static int check_kill_permission(int sig, struct siginfo *info,
if (error)
return error;
- cred = current_cred();
- tcred = __task_cred(t);
if ...(Note, here and elsewhere these should have been const. My next posting, with some new patches, will have that fix) thanks, -serge --
Err, no. +1 is to ensure it's never freed, but since allocator will BUG_ON (?) if fed with static object, maybe it's completely bogus. --
Right. On the one hand so long as it's high enough it doesn't really matter. But I think it's worth getting the # exactly right in order to help document what's going on. So I'll try dropping it back down to 2. thanks, -serge --
Hmm, actually user_namespace isn't in struct nsproxy any more, since 18b6e0414e42d95183f07d8177e3ff0241abd825. I think my original thought was that init_user_ns is pinned twice by root_user - once as creator and once as the root user in it. In which case it isn't right - the user_ns pins the task which created it, the creator does not pin the user_ns it creates. -serge --
Wait, WTF? You have a static kref and you try to automatically instanciate it here? As it's static, why are you even having a kref at all, what good does it do you, you can't delete the thing, it's always around, so just remove it entirely please. Or, dynamically create it properly. In other words, this is majorly broken. thanks, greg k-h --
There is a very weird case for the data structures the initial task has references to. The initial task never goes away and so those data structure never go away. Furthermore we need many of those data structures before we have a memory allocator ready. So we statically allocate a single data structure and up it's reference count to ensure that the count never goes to zero. There are also major benefits to have the version of something that is never freed never going away, because it means you can just reference it in code. So while I would be happy to say this is special don't use a kref and roll the reference counting logic by hand, we aren't dynamically allocating init_uts_ns any time soon. Eric --
Why not just dynamically create this structure once and then, if what you say is really true, you never have to worry about the structure Why have a reference count at all if it's not needed or used here? confused, greg k-h --
We have to reference count every other uts namespace. Eric --
Ok, that makes sense, then also please dynamically create this one, do not create a static kref. thanks, greg k-h --
Nope. It's a bad idea. It messes up the kernel bootstrap if you do that, and it makes this one structure different from every other structure init_task uses. Eric --
{sigh}
Ok, but I really don't like this use.
Also, don't go messing with that ATOMIC_INIT() to be a higher value, as
this patch series did, as that really implies that it is being used
incorrectly, right?
thanks,
greg k-h
--
Can't delete this one, but can delete all the uts namespaces, obviously. As with init_tgcred in kernel/cred.c. It's initialized with a refcount which will keep it from ever getting If we create it dynamically, then I don't think we can use it the way we do in kernel/utsname_sysctl.c for instance. thanks, -serge --
That's my point. You are getting "lucky" by creating a static kref. Why not? It's just a pointer to the structure instead of the structure itself, right? thanks, greg k-h --
