If a user manages to trigger a kernel BUG() or page fault with fs set to
KERNEL_DS, fs is not otherwise reset before do_exit(), allowing the user to
write a 0 to an arbitrary address in kernel memory.
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
---
AFAICT this is presently only triggerable in the presence of another bug, but
this potentially turns a lot of DoS bugs into privilege escalation, so it's
worth fixing. Among other things, sock_no_sendpage and the kernel_{read,write}v
calls in splice.c make it easy to call an awful lot of the kernel under
KERNEL_DS.
This isn't the only way we could fix this -- we could put the set_fs() at the
start of do_exit, or in all the callers that might call potentially do_exit with
KERNEL_DS set, or else we could do an access_ok inside fork(). I'm happy to put
together one of those patches if someone thinks another approach makes more
sense.
kernel/fork.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..a68445e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -636,7 +636,12 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
/*
* We don't check the error code - if userspace has
* not set up a proper pointer then tough luck.
+ *
+ * We do set_fs() explicitly in case this task
+ * exited while inside set_fs(KERNEL_DS) for
+ * some reason (e.g. on a BUG()).
*/
+ set_fs(USER_DS);
put_user(0, tsk->clear_child_tid);
sys_futex(tsk->clear_child_tid, FUTEX_WAKE,
1, NULL, NULL, 0);
--
1.7.1.31.g6297e
--
On Mon, 29 Nov 2010 21:19:16 -0500 Confused. The user can only exploit the wrong addr_limit if control returns to userspace for the user's code to execute. But that won't be happening, because this thread will unconditionally exit. If/when you unconfuse me, I'd suggest this change only be done if the thread is *known* to have oopsed - doing it for non-oopsed threads seems unpleasant to my mind. And I think it should be done nice and clearly, right up inside do_exit() by some means. Or perhaps in the oops code, just before it calls do_exit(). Not hidden down in mm_release(). --
The user can exploit the wrong addr_limit on the very next line, with the put_user() there. clear_child_tid is not checked in any way before this point. Writing a single zero might not seem like much, but it's enough for privilege escalation (e.g. overwrite the top half of a function pointer to point to userspace). I have a PoC code that uses this bug, along with CVE-2010-3849, to write a zero to an arbitrary kernel address, so I've tested that this is not theoretical. That's also why I put the set_fs() hidden inside mm_release, since that's the only place where (to my knowledge) it matters. On re-reading, I didn't mention clear_child_tid anywhere in the commit message, which was an error on my part, and explains the confusion. Sorry about that, and I hope this clears that up. Let me know if this makes more sense, and I'll send a revised patch. --
I do think it would be better to fix up the addr_limit somewhere within the oops code rather than in the regular code. Presumably just before calling do_exit(). Isn't that the logical place? Plus it fixes up any other such problems, whether they be there now or in the future. Although that involves altering every arch, each in multiple places. ick. Maybe at the start of do_exit()? --
If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c) do a put_user to a user-controlled address, potentially allowing a user to leverage an oops into a controlled write into kernel memory. A more logical place to put this might be when we know an oops has occurred, before we call do_exit(), but that would involve changing every architecture, in multiple places. Let's just stick it in do_exit instead. Signed-off-by: Nelson Elhage <nelhage@ksplice.com> --- kernel/exit.c | 8 ++++++++ 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index 21aa7b3..68899b3 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -914,6 +914,14 @@ NORET_TYPE void do_exit(long code) if (unlikely(!tsk->pid)) panic("Attempted to kill the idle task!"); + /* + * If do_exit is called because this processes oopsed, it's possible + * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before + * continuing. This is relevant at least for clearing clear_child_tid in + * mm_release. + */ + set_fs(USER_DS); + tracehook_report_exit(&code); validate_creds_for_do_exit(tsk); -- 1.7.1.31.g6297e --
"This is relevant" is no good explanation ;) Please recognize this is tricky code and Please consider to write more careful and looooong comments. Thanks. --
On Wed, 1 Dec 2010 11:50:32 +0900 (JST) I've seen worse comments. And occasionally none at all :) Is this better? --- a/kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds-fix +++ a/kernel/exit.c @@ -917,8 +917,9 @@ NORET_TYPE void do_exit(long code) /* * If do_exit is called because this processes oopsed, it's possible * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before - * continuing. This is relevant at least for clearing clear_child_tid in - * mm_release. + * continuing. Amongst other possible reasons, this is to prevent + * mm_release()->clear_child_tid() from writing to a user-controlled + * kernel address. */ set_fs(USER_DS); _ --
On Tue, 30 Nov 2010 21:27:36 -0500 I think that the potential of escalating an oops or a BUG into a local root hole is pretty serious so I'll send this fix along for 2.6.37 and I tagged it for -stable backporting, along with a sterner-sounding changelog. From: Nelson Elhage <nelhage@ksplice.com> If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c) do a put_user to a user-controlled address, potentially allowing a user to leverage an oops into a controlled write into kernel memory. This is only triggerable in the presence of another bug, but this potentially turns a lot of DoS bugs into privilege escalations, so it's worth fixing. I have proof-of-concept code which uses this bug along with CVE-2010-3849 to write a zero to an arbitrary kernel address, so I've tested that this is not theoretical. A more logical place to put this fix might be when we know an oops has occurred, before we call do_exit(), but that would involve changing every architecture, in multiple places. Let's just stick it in do_exit instead. [akpm@linux-foundation.org: update code comment] Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- kernel/exit.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff -puN kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds kernel/exit.c --- a/kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds +++ a/kernel/exit.c @@ -914,6 +914,15 @@ NORET_TYPE void do_exit(long code) if (unlikely(!tsk->pid)) panic("Attempted to kill the idle task!"); + /* + * If do_exit is called because this processes oopsed, it's possible + * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before + * continuing. Amongst other possible reasons, this is to prevent + * ...
