Every now and then, someone wants to let unprivileged programs change something about their execution environment (think unsharing namespaces, changing capabilities, disabling networking, chrooting, mounting and unmounting filesystems). Whether or not any of these abilities are good ideas, there's a recurring problem that gets most of these patches shot down: setuid executables. The obvious solution is to allow a process to opt out of setuid semantics and require processes to do this before using these shiny new features. [1] [2] But there's a problem with this, too: with LSMs running, execve can do pretty much anything, and even unprivileged users running unprivileged programs can have crazy security implications. (Take a look at a default install of Fedora. If you can understand the security implications of disabling setuid, you get a cookie. If you can figure out which programs will result in a change of security label when exec'd, you get another cookie.) So here's another solution, based on the idea that in a sane world, execve should be a lot less magical than it is. Any unprivileged program can open an executable, parse its headers, map it, and run it, although getting all the details right is tedious at best (and there's no good way to get all of the threading semantics right from userspace). Patch 1 adds a new syscall execve_nosecurity. It does an exec, but without changing any security properties. This means no setuid, no setgid, no LSM credential hooks (e.g. no SELinux type transitions), and no ptrace restrictions. (You have to have read access to the program, because disabling security stuff could allow someone to ptrace a program that they couldn't otherwise ptrace.) This shouldn't be particularly scary -- any process could do much the same thing with open and mmap. (You can easily shoot yourself in the foot with this syscall -- think LD_PRELOAD or running some program with insufficient error checking that can get subverted when run in the wrong ...
This adds a prctl PR_RESTRICT_ME that enables restrictions that cannot be
disabled and are inherited by children. There's a long history of dangerous
patches that add similar restrictions that persist across execve. This is
bad: execve can grant new privileges, and restrictions on exec'd programs
can be used to subvert them.
To avoid this issue, the very first PR_RESTRICT_ME restriction bit is
PR_RESTRICT_EXEC, which simply disables exec.
In the presence of execve_nosecurity, this can be used to shoot oneself in
the foot, but it should not be possible to shoot other people in the foot
with this patch.
Any future PR_RESTRICT_ME bits should not be allowed to be set unless
PR_RESTRICT_EXEC is also set.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
fs/compat.c | 5 +++++
fs/exec.c | 5 +++++
include/linux/prctl.h | 6 ++++++
include/linux/sched.h | 2 ++
kernel/fork.c | 2 ++
kernel/sys.c | 29 +++++++++++++++++++++++++++++
6 files changed, 49 insertions(+), 0 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 585a2d7..a091da6 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1468,6 +1468,11 @@ int compat_do_execve(char * filename,
bool clear_in_exec;
int retval;
+ if (current->restrict_exec && change_security) {
+ retval = -EPERM;
+ goto out_ret;
+ }
+
retval = unshare_files(&displaced);
if (retval)
goto out_ret;
diff --git a/fs/exec.c b/fs/exec.c
index 4067b65..37fb5fa 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1350,6 +1350,11 @@ int do_execve(char * filename,
bool clear_in_exec;
int retval;
+ if (current->restrict_exec && change_security) {
+ retval = -EPERM;
+ goto out_ret;
+ }
+
retval = unshare_files(&displaced);
if (retval)
goto out_ret;
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..b926055 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,10 @@
#define PR_MCE_KILL_GET 34
+/* ...This flag is preserved across execve_nosecurity. It's obviously dangerous, so
we only allow it if PR_RESTRICT_EXEC is set.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
fs/compat.c | 3 +++
fs/exec.c | 3 +++
include/linux/prctl.h | 5 +++++
include/linux/sched.h | 1 +
kernel/fork.c | 1 +
kernel/sys.c | 13 +++++++++++++
6 files changed, 26 insertions(+), 0 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index a091da6..4b7f61f 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1468,6 +1468,9 @@ int compat_do_execve(char * filename,
bool clear_in_exec;
int retval;
+ if (current->force_execve_nosecurity)
+ change_security = false;
+
if (current->restrict_exec && change_security) {
retval = -EPERM;
goto out_ret;
diff --git a/fs/exec.c b/fs/exec.c
index 37fb5fa..0e045b8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1350,6 +1350,9 @@ int do_execve(char * filename,
bool clear_in_exec;
int retval;
+ if (current->force_execve_nosecurity)
+ change_security = false;
+
if (current->restrict_exec && change_security) {
retval = -EPERM;
goto out_ret;
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index b926055..8465df3 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -108,4 +108,9 @@
#define PR_GET_RESTRICT 36
+/* Get/set execve -> execve_nosecurity remapping. */
+#define PR_SET_FORCE_EXECVE_NOSECURITY 37
+#define PR_GET_FORCE_EXECVE_NOSECURITY 38
+
+
#endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d1956f7..59f7bcd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1303,6 +1303,7 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned restrict_exec:1; /* Process may not call execve. */
+ unsigned force_execve_nosecurity:1; /* execve means execve_nosecurity */
pid_t pid;
pid_t tgid;
diff --git a/kernel/fork.c b/kernel/fork.c
index 8f994e5..d7e1688 ...No responses for a month after this was sent. Really, thanks, I do appreciate the work at another approach. I'll be honest, I prefer option [1]. Though I think it's reasonable to require privilege for prctl(PR_SET_NOSUID). Make it a separate capability, and on most systems it should be safe to have a file sitting in /bin with cap_set_nosuid+pe. If OTOH you know you have legacy or poorly coded privileged programs which would not be safe bc they don't verify that they have the needed privs, you just don't provide the program to do prctl(PR_SET_NOSUID) for unprivileged users. ( I did like using new securebits as in [2], but I prefer the automatic not-raising-privs of [1] to simply -EPERM on uid/gid change and lack kof checking for privs raising of [2]. ) Really the trick will be finding a balance to satisfy those wanting this as a separate LSM, without traipsing into LSM stacking territory. I myself think this feature fits very nicely with established semantics, but not everyone agrees, so chances are my view is a bit tainted, and we should defer to those wanting this to be an LSM. Of course, another alternative is to skip this feature altogether and push toward targeted capabilties. The problem is that path amounts to playing whack-a-mole to catch all the places where privilege might leak to a parent namespace, whereas [1] simply, cleanly cuts them all off at the source. thanks, -serge --
Both approaches result in two kinds of exec: the normal kind that respects setuid, file capabilities, and LSMs, and the restricted kind that is supposed to be safe when programs have unshared namespaces and other crazy things. Eric's approach [1] adds a restricted kind of exec that ignores setuid but still (AFAICT) respects file capabilities and LSM transitions. I think this is a terrible idea for two reasons: 1. LSM transitions already scare me enough, and if anyone relies on them working in concert with setuid, then the mere act of separating them might break things, even if the "privileged" (by LSM) app in question is well-written. 2. File capabilities are just as dangerous as setuid, and I wouldn't even know how to write a program that's safe when it has extra capabilities granted by fE (or fP or whatever it is) and the caller has, say, an unshared fs namespace and the ability to rearrange the namespace arbitrarily. In short, I think that this nosuid exec is both dangerous in and of itself *and* doesn't actually solve the problem it was supposed to solve. I also don't like relying on the admin to decide that it's safe to allow PR_SET_NOSUID (or whatever you call it) and having to install a special privileged program to enable it. If sandbox-like features require explicit action by root, then they won't be as widely used as they should be. And how many admins will have any clue whether enabling this feature is safe? My approach introduces what I think is a much more obviously safe restricted exec, and I think it's so safe that no privilege or special configuration should be required to use it. As for what to call it (execve_nosecurity or PR_SET_NOSUID) or whether to have a special syscall so that programs that aren't restricted can use the restricted exec, I don't care all that much. I just think that the separate syscall might be useful in its own right and I think that making this an LSM is absurd. Containers (and anything else people want to do ...
hmm... Absolutely these should not be ignored, and Eric didn't mean to ignore I do not agree with deciding the admins are not competent to admin their system and therefore we should bypass them and let users decide. But it's moot, as I think you've convinced me with your point 1. above Yes, but that's a reason to aim for targeted caps. Exec_nopriv or Not sure what you mean by that last part - inside the sandbox, you won't get capabilities, targeted or otherwise, but certainly targeted capabilities and a sandbox are not mutually exclusive. Thanks for responding, I'll take another look at your patchset in detail. thanks, -serge --
Is a targeted cap something like "process A can call setdomainname, Agreed. What I want is a syscall that says "make me a sandbox" and then for that program to be able to intercept and modify most (all?) syscalls issued from inside the sandbox. But programs in the sandbox probably need to call exec, and if the sandbox's owner can muck around with exec'd programs, then exec had better have no security effect. Hence a need for some kind of restricted exec. The sandbox owner would then make up own targeted capabilities if needed. But yes, targeted capabilities for kernel containers are probably Thanks! --Andy --
Right, only to the UTS ns in which you live. See for instance http://thread.gmane.org/gmane.linux.kernel.containers/15934 . It's how we express for instance that root in a child user_namespace has CAP_DAC_OVERRIDE over files in the container, but not over the host. -serge --
At least in the case of SELinux, context transitions upon execve are already disabled in the nosuid case, and Eric's patch updated the SELinux test accordingly. -- Stephen Smalley National Security Agency --
True, but I think it's still asking for trouble -- other LSMs could (and almost certainly will, especially the out-of-tree ones) do something, and I think that any action at all that an LSM takes in the bprm_set_creds hook for a nosuid (or whatever it's called) process is wrong or at best misguided. Can you think of anything that an LSM should do (or even should be able to do) when a nosuid process calls exec, other than denying the request outright? With my patch, LSMs can still reject the open_exec call. --Andy --
I could be wrong, but I think the point is that your reasoning is correct, and that the same reasoning must apply if we're just --
I tend to agree, except that only root can set nosuid (presumably) and making that change will change existing behavior. Is that a problem? --Andy --
I think Stephen has just convinced me that MNT_NOSUID will never make sense -- there's odd legacy behavior in there and we'll probably never get anyone to change it. So if we give up on changing nosuid, there are a couple of things we might want to do: 1. A mode where execve acts like all filesystems are MNT_NOSUID. This sounds like a bad idea (if nothing else, it will cause apps that use selinux's exec_sid mechanism (runcon?) to silently malfunction). 2. A mode where execve (or a new syscall?) has no effect on credentials at all. This is conceptually simple and it would be great for new userspace code, especially code that wants to do something sandbox-like. For simplicity, even things like the effective and inherited capability sets should probably remain unchanged. In this mode, we'll have to disallow execing unreadable files. securebits are (almost) irrelevant. This is what my patch does. Dealing with AT_SECURE will be awkward at best, so programs that enter this mode should sanitize their own environments and should be very careful if they were setuid. (But they should do that anyway.) There are a couple of annoyances to deal with. First, there are LSM API issues, like this code in SELinux: new_tsec->osid = old_tsec->sid; /* Reset fs, key, and sock SIDs on execve. */ new_tsec->create_sid = 0; new_tsec->keycreate_sid = 0; new_tsec->sockcreate_sid = 0; and this code in commoncap: new->suid = new->fsuid = new->euid; new->sgid = new->fsgid = new->egid; I have no problem keeping these. The other annoyance is cap_effective. We could clear it on every exec (what commoncap does for non-legacy executables, I think), but that would completely break any legacy code running as root. We could set it to cap_permitted on every exec, which sounds like bad engineering even though I don't see any specific problem with it. We could also just leave it alone across exec, which might have odd side effects for programs which change their effective set ...
I think at this point we've lost track of exactly what we're trying to do. The goal, at least for myself and (I think) Eric, was to prevent certain changes in environment, initiated by an unprivileged user, from confusing setuid-root programs (initiated by the user). A concrete example was the proposed disablenet feature, with which an unprivileged task can remove its ability to open any new network connections. With that in mind, I think option 1 is actually the best option. I especially hate option 2 because of the resulting temptation to fudge with pE :) If you're going to fudge with pE, then IMO it MUST be done in a new securebits mode. Now actually, re-reading my msg, given our original goal, I dare say that Andrew Morgan's approach of simply returning -EPERM for any app which tries to setuid or change privileges on exec just might be the sanest way, at least to start with. -serge --
I think the show-stopper for number 1 is the fact that nosuid has really strange semantics, and I'm a bit scared of making them more widespread. For example, selinux-aware apps can request a type change on exec, and nosuid causes that request to be silently ignored. This could silently break otherwise-working selinux sandboxes. Stephen I'll fight that fight later. (I wish the original rule had been pE' = Fair enough. It'll annoy some selinux users, but maybe the selinux people will figure out how to fix it when enough users complain. I'll hack up and submit a patch series to add PR_EXEC_DISALLOW_PRIVS and allow CLONE_NEWNET when it's set. Then I'll argue with Alan Cox for a week or three, I suppose :) I think I'll arrange it so that PR_EXEC_DISALLOW_PRIVS & uid==0 && (pP != all) && !SECURE_ROOT will cause execve to always fail. nonoot && pP != 0 && !KEEPCAPS will fail as well, since it seems silly to add a special case (if you're nonroot and create an unprivileged container, drop the caps yourself). --Andy (My system has a setuid binary that does unshare(CLONE_NEWIPC), drops privs and execs it's argument. I'll be happy to get rid of it.) --
In the case where the context transition would shed permissions rather than gain permissions, it has been suggested that we shouldn't disable the transition even in the presence of nosuid. But automatically computing that for a domain transition is non-trivial, so we have the present behavior for SELinux. There also can be state updates even in the non-suid exec case, e.g. saved uids, clearing capabilities, etc. But as far as the access control goes, it should suffice to check read and execute access to the file, just as with the userland ELF loader scenario (which gets handled by the mmap hook). -- Stephen Smalley National Security Agency --
Ah, right. In my patch, execve_nosecurity is (or will be, anyway) documented to skip all of this, and it's a new syscall, so nothing should need to be done. It doesn't allow anything that a userland ELF loader couldn't already do. (I'm not thrilled with changing the behavior of the original execve syscall, but one way or another, any nosuid mechanism will probably allow programs to exec other things without losing permissions that the admin might have expected. I don't see this is a real problem, though.) Is it even possible to purely drop permissions in SELinux? If your original type was orig_t and your new type is new_t, and if the rights granted to orig_t and new_t overlap nontrivially, then what are you supposed to do? Check both types for each hook? (Some annoying admin could even *change* the rights for orig_t or new_t after execve --
The further you deviate from existing execve semantics, the less likely your solution will work cleanly as a transparent replacement for execve for userland running in this nosuid state, and the less compelling the case for implementing execve_nosecurity in the kernel vs. just userspace It has always been possible to configure policy such that one type is less privileged than its caller, and the typebounds construct introduced in more recent SELinux provides a kernel-enforced mechanism for ensuring that one type is strictly bounded by the permissions of another type. -- Stephen Smalley National Security Agency --
I don't see that code in current -linus, nor do I see where SELinux affects dumpability. What's supposed to happen? I'm writing a patch right now to clean this stuff up. --Andy --
check out security/selinux/hooks.c:selinux_bprm_set_creds() if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) new_tsec->sid = old_tsec->sid; I assume that's it? -serge --
