login
Header Space

 
 

Re: [PATCH REPOST^3] Run IST traps from user mode preemptive on process stack

Previous thread: [PATCH] i386: Execute stack overflow warning on interrupt stack by Andi Kleen on Friday, May 2, 2008 - 5:18 am. (13 messages)

Next thread: motive for sending Ethernet packets in soft-irq context by Parav Pandit on Friday, May 2, 2008 - 5:16 am. (2 messages)
To: <tglx@...>, <mingo@...>, <linux-kernel@...>, <jkosina@...>, <zdenek.kabelac@...>
Date: Friday, May 2, 2008 - 5:19 am

[Fixes a bug originally introduced in 2.6.23 I think. Reposted many times
so far, but still not applied. Please apply.]

Run IST traps from user mode preemptive on process stack

x86-64 has a few exceptions which run on special architecture 
supported IST exception stacks: these are nmi, double fault, stack fault,
int 3, debug.

Previously they would check for a scheduling event on returning
if the original CPU state was user mode and then switch to a 
process stack to schedule.

But the actual trap handler would still run on the IST stack
with preemption disabled.

This patch changes these traps instead to always switch 
to the process stack when the trap originated from user mode.
For kernel traps it keeps running non preemptive on the IST stack
because that is much safer (e.g. to still get nmi watchdog events
out even when the process stack is corrupted) 

Then the actual trap handlers can run with preemption enabled
or schedule as needed (e.g. to take locks) 

This fixes a regression I added earlier with print_vma_addr()
executing down() in these trap handlers from user space.

Strictly the change would have been only needed for debug 
and int3, but since they share this code as macros it was 
cleanest to just change all.

Cc: jkosina@suse.cz
Cc: zdenek.kabelac@gmail.com

Signed-off-by: Andi Kleen &lt;ak@suse.de&gt;
Signed-off-by: Andi Kleen &lt;andi@firstfloor.org&gt;

---
 arch/x86/kernel/entry_64.S |   17 ++++++++++-------
 arch/x86/kernel/traps_64.c |   17 ++++++++++-------
 2 files changed, 20 insertions(+), 14 deletions(-)

Index: linux/arch/x86/kernel/entry_64.S
===================================================================
--- linux.orig/arch/x86/kernel/entry_64.S
+++ linux/arch/x86/kernel/entry_64.S
@@ -771,12 +771,18 @@ END(spurious_interrupt)
 	.if \ist
 	movq	%gs:pda_data_offset, %rbp
 	.endif
-	movq %rsp,%rdi
 	movq ORIG_RAX(%rsp),%rsi
 	movq $-1,ORIG_RAX(%rsp)
 	.if \ist
 	subq	$EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) ...
To: Andi Kleen <andi@...>
Cc: <mingo@...>, <linux-kernel@...>, <jkosina@...>, <zdenek.kabelac@...>
Date: Tuesday, May 6, 2008 - 8:31 am

The print_vma_addr() bug was introduced by commit 03252919 (by you) in
the 2.6.25 merge window and we fixed it before 25-rc2 via commit e8bff74a.

Do you know about any other breakage in that code? If not, why do you


Aside of the fact, that it has been fixed long time ago in a simple
and non dangerous way already, your "fix" adds a far more serious

This introduces two security bugs in one go:

1.) The IST stack pointer is elevated unconditionaly and with your
change we can now schedule away in the handler. Then another task on
this same CPU triggers the same trap and elevates it again. This can
nest multiple times and there is no protection against an IST stack
overflow at all!

This is probably a nice fat user-triggerable root hole in fact!
Exploitable because user-space has control over the pt_regs data,
so by nesting enough such overflows a critical kernel data structure
can probably be written with near-arbitrary content.

At minimum you've introuduced a nasty DoS hole that would almost never
trigger during normal loads - it would probably result in extremely hard
to debug memory corruption in data structures that are near the IST stacks.

2.) The IST stack pointer is unbalanced in the other direction as well:
it is incremented on CPUn then the handler is scheduled away and migrated
to CPUm. After return from the handler the IST stacks of both CPUs are
corrupted. Exploitable by unprivileged user-space code as well.

Thanks,

	tglx
--
To: Thomas Gleixner <tglx@...>
Cc: <mingo@...>, <linux-kernel@...>, <jkosina@...>, <zdenek.kabelac@...>
Date: Tuesday, May 6, 2008 - 9:03 am

Well it was worked around, not properly fixed. This patch fixes it properly.
The problem of the original workaround is that it wouldn't print the vma
now in many cases because it couldn't take the semaphore.

The workaround was right back then because it was shortly before 
the release, but it was always a ward that needed fixing properly.

I believe it was a good idea anyways because there were always 
some possible problems with not being able to sleep in these 


That's not possible generally. None of these interrupts can
nest in a normal kernel.

Do you refer to the DEBUG_STACK ist add/dec? That is not needed

The IST is restored again after the handler. You're right that strictly
wouldn't be needed and could be avoided, but i don't think it's exploitable
in any ways.

Regarding user controlled pt_regs: I think you're forgetting that
x86-64 doesn't have vm86 mode and that we can always trust pt_regs
segment indexes. On i386 you would be right, but not here.

-Andi
--
To: Andi Kleen <andi@...>
Cc: Thomas Gleixner <tglx@...>, <linux-kernel@...>, <jkosina@...>, <zdenek.kabelac@...>
Date: Tuesday, May 6, 2008 - 10:41 am

First you are putting pressure on us by innuendo that we somehow did not 
react to an unfixed x86 bug, supposedly starting in v2.6.23 and still 
not fixed in v2.6.26 as we speak. This is a pretty serious accusation!

But, if one looks closer, your accusation does not hold up to scrutiny: 
it turns out that it is not in v2.6.23 but v2.6.25, that it got fixed 
within 2 weeks between v2.6.25-rc1 and v2.6.25-rc2, that it is not an 
unfixed bug that we ignored but at most about a small detail in a debug 
feature introduced by _you_.

Why are you doing this with us? First you are accusing us and then you 
are ignoring straight questions that would clear us from your unfair 
accusation. Why?

	Ingo
--
To: Andi Kleen <andi@...>
Cc: Thomas Gleixner <tglx@...>, <linux-kernel@...>, <jkosina@...>, <zdenek.kabelac@...>
Date: Tuesday, May 6, 2008 - 10:39 am

* Andi Kleen &lt;andi@firstfloor.org&gt; wrote:


huh? While this issue is dwarfed by the security hole your patch 
introduces, you miss the whole point about debug printouts in case of 
traps.

In practice we dont need to print out _anything_ from int3 traps (even 
if they were unexpected) - user-space very much knows it has set a 
breakpoint.

What we are interested in are the segmentation faults for example. Those 
do get printed out correctly as segmentation faults do not go via IST 
traps, they go via the normal process stack.

Furthermore, we _do_ print out the fault location even for int3 if we 
are not preemptible. An example i just triggered on latest -git:

  int3[2789] trap int3 ip:4004cd sp:7fff27501c50 error:0

And we do print out the vma information too in other, much more 
interesting trap types such as unresolved page faults:

  segfault[2652]: segfault at 0 ip 400471 sp 7fff05d42480 error 6 in segfault[400000+1000]

So what we do worst-case is that we do not do a find_vma() and we dont 
print out the vma. Not a big deal at all for an int3 or a hw-breakpoint 
trap ...

	Ingo
--
To: Andi Kleen <andi@...>
Cc: Thomas Gleixner <tglx@...>, <linux-kernel@...>, <jkosina@...>, <zdenek.kabelac@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, May 6, 2008 - 10:34 am

no, your patch is not fine at all. It introduces a security hole, as 
Thomas pointed it out to you.

The IST pointer in the _TSS_ can go out of bounds by your patch! That 
corruption can be controlled by malicious userspace. Read the analysis 
from Thomas.

That percpu_tss.ist[] location is not just a random pointer. It is the 
TSS that is read by the CPU on every trap. It is a security-critical 
structure and every modification to it has to be treated very, very 
carefully. If it goes out of bounds that leads to very nasty problems.

This is a pretty bad security bug, and your reply shows ignorance about 
the code your patch is modifying - _after_ Thomas has pointed it out to 

read the analysis from Thomas. Here is a sample buggy sequence:

  task #1 runs, does int3, goes on DEBUG_STACK IST stack,
      we do -= 4096 of the tss.ist[DEBUG_STACK-1] pointer, handler schedules
  task #2 runs, does int3, goes on DEBUG_STACK IST stack, handler schedules
      we do -= 4096 of the tss.ist[DEBUG_STACK-1] pointer, handler schedules
  task #3 runs, does int3, goes on DEBUG_STACK IST

  =&gt; poof, stack underflow, memory corruption of nearby data structures,
     because the DEBUG_STACK is only 8KB...


Wrong. The pointer the subq/addq instructions modify are in fact used by 
_EVERY SINGLE_ IST trap sequence that the 64-bit kernel executes (!).

This is the modification that the paranoidentry macro does to the TSS in 
entry_64.S:

  subq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp)
  ...
  addq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp)

that modifies the TSS _directly_. The TSS is directly read by the CPU on 
every trap entry. The percpu_tss.ist[] array of pointers is then used by 
the CPU for every single IST trap. (hw-debug, NMI, int3, double-fault, 
stacksegment-overflow, MCE)

This is nothing new or unexpected, it is basic x86-64 IST architectural 
behavior implemented in the CPU.

The bug you introduce is that if the ...
Previous thread: [PATCH] i386: Execute stack overflow warning on interrupt stack by Andi Kleen on Friday, May 2, 2008 - 5:18 am. (13 messages)

Next thread: motive for sending Ethernet packets in soft-irq context by Parav Pandit on Friday, May 2, 2008 - 5:16 am. (2 messages)
speck-geostationary