The following 3 patches are a small start to the long task of fleshing
out user namespace support. These three patches just clear up the
relationship of struct user_struct to struct user_namespace (patch 1),
fix up the refcounting (patch 2), and complete the action of switching
users when cloning a new user_namespace (patch 3).
Patch 1:
When a task does clone(CLONE_NEWNS), the task's user is the 'creator' of the
new user_namespace, and the user_namespace is tacked onto a list of those
created by this user.
Changelog:
Aug 25: make free_user not inlined as it's not trivial. (Eric
Biederman suggestion)
Aug 1: renamed user->user_namespace to user_ns, as the next
patch did anyway.
Aug 1: move put_user_ns call in one free_user() definition
to move it outside the lock in free_user. put_user_ns
calls free_user on the user_ns->creator, which in
turn would grab the lock again.
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
---
include/linux/sched.h | 1 +
include/linux/user_namespace.h | 1 +
kernel/user.c | 11 +++++++++--
kernel/user_namespace.c | 20 +++++++++++---------
4 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfb0d87..9bebf95 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -601,6 +601,7 @@ struct user_struct {
/* Hash table maintenance information */
struct hlist_node uidhash_node;
uid_t uid;
+ struct user_namespace *user_ns;
#ifdef CONFIG_USER_SCHED
struct task_group *tg;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index b5f41d4..f9477c3 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -13,6 +13,7 @@ struct user_namespace {
struct kref kref;
struct hlist_head uidhash_table[UIDHASH_SZ];
struct user_struct *root_user;
+ struct user_struct *creator;
};
extern struct user_namespace init_user_ns;
diff --git ...When we get the sysfs support needed to support fair user scheduling along with user namespaces, then we will need to be able to get the user namespace from the user struct. So we need the user_ns to be a part of struct user. Once we can access it from tsk->user, we no longer have a use for tsk->nsproxy->user_ns. When a user_namespace is created, the user which created it is marked as its 'creator'. The user_namespace pins the creator. Each userid in a user_ns pins the user_ns. This keeps refcounting nice and simple. At the same time, this patch simplifies the refcounting. The current user and userns locking works as follows: The task pins the user struct. The task pins the nsproxy, the nsproxy pins the user_ns. When a new user_ns is created, it creates a root user for it, and pins it. When the nsproxy releases the user_ns, the userns tries to release all its user structs. So you see that the refcounting "works" for now, but only because the nsproxy (and therefore usr_ns) and user structs will be freed at the same time - when the last task using them is released. Now we need to put the user_ns in the struct user. You can see that will mess up the refcounting. Fortunately, once the user_ns is available from tsk->user, we don't need it in nsproxy. So here is how the refcounting *should* be done: The task pins the user struct. The user struct pins its user namespace. The user namespace pins the struct user which created it. A user namespace now doesn't need to release its userids, because it is only released when its last user disappears. This patch makes those changes. Signed-off-by: Serge Hallyn <serue@us.ibm.com> --- include/linux/init_task.h | 1 - include/linux/key.h | 3 ++ include/linux/nsproxy.h | 1 - include/linux/sched.h | 1 + include/linux/user_namespace.h | 10 +++----- kernel/fork.c | 3 +- kernel/nsproxy.c | 10 ...
Currently, creating a new user namespace does not reset the task's uid or gid. Since generally that is done as root because it requires CAP_SYS_ADMIN, and since the first uid in the new namespace is 0, one usually doesn't notice. However, if one does capset cap_sys_admin=ep ns_exec su - hallyn ns_exec -U /bin/sh id then one will see hallyn's userid, and all preexisting groups. With this patch, cloning a new user namespace will set the task's uid and gid to 0, and reset the group_info to the empty set assigned to init. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> --- kernel/user_namespace.c | 14 ++++++++++++++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index d59f193..16e6296 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -10,6 +10,9 @@ #include <linux/slab.h> #include <linux/user_namespace.h> +/* defined in kernel/sys.c */ +extern struct group_info init_groups; + /* * Clone a new ns copying an original user ns, setting refcount to 1 * @old_ns: namespace to clone @@ -47,6 +50,17 @@ int create_new_userns(int flags, struct task_struct *tsk) put_user_ns(ns); task_switch_uid(tsk, ns->root_user); + tsk->uid = tsk->euid = tsk->suid = tsk->fsuid = 0; + tsk->gid = tsk->egid = tsk->sgid = tsk->fsgid = 0; + + /* this can't be safe for unshare, can it? it's safe + * for fork, though. I'm tempted to limit clone_newuser to + * fork only */ + task_lock(tsk); + put_group_info(tsk->group_info); + tsk->group_info = &init_groups; + get_group_info(tsk->group_info); + task_unlock(tsk); return 0; } -- 1.5.4.3 --
On Tue, 26 Aug 2008 13:53:41 -0500 The credentials code in linux-next is changing the same code which you're changing, in more-than-trivially-textual ways. I'd suggest a dhowells cc on these changes, as he's also working in this area, and as you touch the keyring code a bit. And, of course, please remove that almost-always-wrong extern-declaration-in-C which checkpatch told you about. init_groups is already declared in include/linux/init_task.h anyway... --
Sorry, will do with the rebase. Thanks, Andrew. -serge --
