Re: [PATCH] mm_release: Do a set_fs(USER_DS) before handling clear_child_tid.

Previous thread: linux-next: Tree for November 30 by Stephen Rothwell on Monday, November 29, 2010 - 7:07 pm. (1 message)

Next thread: [PATCH -v2 0/3] Report APEI GHES error information via printk by Huang Ying on Monday, November 29, 2010 - 7:51 pm. (17 messages)
From: Nelson Elhage
Date: Monday, November 29, 2010 - 7:19 pm

If a user manages to trigger a kernel BUG() or page fault with fs set to
KERNEL_DS, fs is not otherwise reset before do_exit(), allowing the user to
write a 0 to an arbitrary address in kernel memory.

Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
---
AFAICT this is presently only triggerable in the presence of another bug, but
this potentially turns a lot of DoS bugs into privilege escalation, so it's
worth fixing. Among other things, sock_no_sendpage and the kernel_{read,write}v
calls in splice.c make it easy to call an awful lot of the kernel under
KERNEL_DS.

This isn't the only way we could fix this -- we could put the set_fs() at the
start of do_exit, or in all the callers that might call potentially do_exit with
KERNEL_DS set, or else we could do an access_ok inside fork(). I'm happy to put
together one of those patches if someone thinks another approach makes more
sense.

 kernel/fork.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..a68445e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -636,7 +636,12 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 			/*
 			 * We don't check the error code - if userspace has
 			 * not set up a proper pointer then tough luck.
+			 *
+			 * We do set_fs() explicitly in case this task
+			 * exited while inside set_fs(KERNEL_DS) for
+			 * some reason (e.g. on a BUG()).
 			 */
+			set_fs(USER_DS);
 			put_user(0, tsk->clear_child_tid);
 			sys_futex(tsk->clear_child_tid, FUTEX_WAKE,
 					1, NULL, NULL, 0);
-- 
1.7.1.31.g6297e

--

From: Andrew Morton
Date: Tuesday, November 30, 2010 - 5:09 pm

On Mon, 29 Nov 2010 21:19:16 -0500

Confused.  The user can only exploit the wrong addr_limit if control
returns to userspace for the user's code to execute.  But that won't be
happening, because this thread will unconditionally exit.


If/when you unconfuse me, I'd suggest this change only be done if the
thread is *known* to have oopsed - doing it for non-oopsed threads
seems unpleasant to my mind.  And I think it should be done nice and
clearly, right up inside do_exit() by some means.  Or perhaps in the
oops code, just before it calls do_exit().  Not hidden down in
mm_release().
--

From: Nelson Elhage
Date: Tuesday, November 30, 2010 - 5:59 pm

The user can exploit the wrong addr_limit on the very next line, with the
put_user() there. clear_child_tid is not checked in any way before this
point. Writing a single zero might not seem like much, but it's enough for
privilege escalation (e.g. overwrite the top half of a function pointer to point
to userspace).

I have a PoC code that uses this bug, along with CVE-2010-3849, to write a zero
to an arbitrary kernel address, so I've tested that this is not theoretical.

That's also why I put the set_fs() hidden inside mm_release, since that's the
only place where (to my knowledge) it matters.

On re-reading, I didn't mention clear_child_tid anywhere in the commit message,
which was an error on my part, and explains the confusion. Sorry about that, and
I hope this clears that up.

Let me know if this makes more sense, and I'll send a revised patch.

--

From: Andrew Morton
Date: Tuesday, November 30, 2010 - 6:49 pm

I do think it would be better to fix up the addr_limit somewhere within
the oops code rather than in the regular code.  Presumably just before
calling do_exit().  Isn't that the logical place?  Plus it fixes up any
other such problems, whether they be there now or in the future.

Although that involves altering every arch, each in multiple places. 
ick.  Maybe at the start of do_exit()?



--

From: Nelson Elhage
Date: Tuesday, November 30, 2010 - 7:27 pm

If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c)
do a put_user to a user-controlled address, potentially allowing a user to
leverage an oops into a controlled write into kernel memory.

A more logical place to put this might be when we know an oops has occurred,
before we call do_exit(), but that would involve changing every architecture, in
multiple places. Let's just stick it in do_exit instead.

Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
---
 kernel/exit.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 21aa7b3..68899b3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -914,6 +914,14 @@ NORET_TYPE void do_exit(long code)
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");
 
+	/*
+	 * If do_exit is called because this processes oopsed, it's possible
+	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
+	 * continuing. This is relevant at least for clearing clear_child_tid in
+	 * mm_release.
+	 */
+	set_fs(USER_DS);
+
 	tracehook_report_exit(&code);
 
 	validate_creds_for_do_exit(tsk);
-- 
1.7.1.31.g6297e

--

From: KOSAKI Motohiro
Date: Tuesday, November 30, 2010 - 7:50 pm

"This is relevant" is no good explanation ;)
Please recognize this is tricky code and Please consider to write more 
careful and looooong comments.

Thanks.


--

From: Andrew Morton
Date: Wednesday, December 1, 2010 - 5:30 pm

On Wed,  1 Dec 2010 11:50:32 +0900 (JST)

I've seen worse comments.  And occasionally none at all :)

Is this better?

--- a/kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds-fix
+++ a/kernel/exit.c
@@ -917,8 +917,9 @@ NORET_TYPE void do_exit(long code)
 	/*
 	 * If do_exit is called because this processes oopsed, it's possible
 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
-	 * continuing. This is relevant at least for clearing clear_child_tid in
-	 * mm_release.
+	 * continuing. Amongst other possible reasons, this is to prevent
+	 * mm_release()->clear_child_tid() from writing to a user-controlled
+	 * kernel address.
 	 */
 	set_fs(USER_DS);
 
_

--

From: KOSAKI Motohiro
Date: Wednesday, December 1, 2010 - 5:48 pm

Great!



--

From: Andrew Morton
Date: Wednesday, December 1, 2010 - 6:12 pm

On Tue, 30 Nov 2010 21:27:36 -0500

I think that the potential of escalating an oops or a BUG into a local
root hole is pretty serious so I'll send this fix along for 2.6.37 and
I tagged it for -stable backporting, along with a sterner-sounding
changelog.



From: Nelson Elhage <nelhage@ksplice.com>

If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
otherwise reset before do_exit().  do_exit may later (via mm_release in
fork.c) do a put_user to a user-controlled address, potentially allowing a
user to leverage an oops into a controlled write into kernel memory.

This is only triggerable in the presence of another bug, but this
potentially turns a lot of DoS bugs into privilege escalations, so it's
worth fixing.  I have proof-of-concept code which uses this bug along with
CVE-2010-3849 to write a zero to an arbitrary kernel address, so I've
tested that this is not theoretical.


A more logical place to put this fix might be when we know an oops has
occurred, before we call do_exit(), but that would involve changing every
architecture, in multiple places.  Let's just stick it in do_exit instead.

[akpm@linux-foundation.org: update code comment]
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/exit.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff -puN kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds kernel/exit.c
--- a/kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds
+++ a/kernel/exit.c
@@ -914,6 +914,15 @@ NORET_TYPE void do_exit(long code)
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");
 
+	/*
+	 * If do_exit is called because this processes oopsed, it's possible
+	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
+	 * continuing. Amongst other possible reasons, this is to prevent
+	 * ...
Previous thread: linux-next: Tree for November 30 by Stephen Rothwell on Monday, November 29, 2010 - 7:07 pm. (1 message)

Next thread: [PATCH -v2 0/3] Report APEI GHES error information via printk by Huang Ying on Monday, November 29, 2010 - 7:51 pm. (17 messages)