Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer

Previous thread: [PATCH] regulator: Update e-mail address for Liam Girdwood by Mark Brown on Friday, August 27, 2010 - 2:33 pm. (2 messages)

Next thread: fs hang in 2.6.27.y | Fwd: [Bug 15658] New: [PATCH] x86 constant_test_bit() prone to misoptimization with gcc-4.4 by Michael Shigorin on Friday, August 27, 2010 - 2:36 pm. (7 messages)
From: Kees Cook
Date: Friday, August 27, 2010 - 3:02 pm

Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c

The recent changes to create a stack guard page helps slightly to
discourage this attack, but it is not sufficient. Compiling it statically
moves the libraries out of the way, allowing the stack VMA to fill the
entire TASK_SIZE.

There are two issues:
 1) the OOM killer doesn't notice this argv memory explosion
 2) the argv expansion does not check if rlim[RLIMIT_STACK].rlim_cur is -1.

I figure a quick solution for #2 would be the following patch. However,
running multiple copies of this program could result in similar OOM
behavior, so issue #1 still needs a solution.

Reported-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Kees Cook <kees.cook@canonical.com>
---
 fs/exec.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index dab85ec..be40063 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -194,7 +194,8 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 		 *    to work from.
 		 */
 		rlim = current->signal->rlim;
-		if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) {
+		if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4 ||
+		    size > TASK_SIZE / 4) {
 			put_page(page);
 			return NULL;
 		}
-- 
1.7.1

-- 
Kees Cook
Ubuntu Security Team
--

From: KOSAKI Motohiro
Date: Sunday, August 29, 2010 - 5:19 pm

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


And, I have a patch for #1. Can you please see this? Alternative idea
is to change rss accounting itself.



From d4e114e5d31b14ebfc399d4b1fb142c7dfce0ca4 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Thu, 19 Aug 2010 20:40:20 +0900
Subject: [PATCH] oom: don't ignore temporary rss while execve

execve() makes new mm struct and setup stack and argv vector,
Unfortunately this new mm is not pointed any tasks, then oom-kill
can't detect this memory usage. therefore oom-kill may kill incorrect
task.

This patch added in-exec rss treatness to oom.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/compat.c             |    8 ++++++--
 fs/exec.c               |   19 +++++++++++++++++--
 include/linux/binfmts.h |    1 +
 include/linux/sched.h   |    1 +
 mm/oom_kill.c           |   36 +++++++++++++++++++++++++++---------
 5 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 718c706..643140c 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1527,6 +1527,7 @@ int compat_do_execve(char * filename,
 	retval = bprm_mm_init(bprm);
 	if (retval)
 		goto out_file;
+	set_exec_mm(bprm->mm);
 
 	bprm->argc = compat_count(argv, MAX_ARG_STRINGS);
 	if ((retval = bprm->argc) < 0)
@@ -1560,6 +1561,7 @@ int compat_do_execve(char * filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	set_exec_mm(NULL);
 	acct_update_integrals(current);
 	free_bprm(bprm);
 	if (displaced)
@@ -1567,8 +1569,10 @@ int compat_do_execve(char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
-		mmput(bprm->mm);
+	if (current->in_exec_mm) {
+		struct mm_struct *in_exec_mm = set_exec_mm(NULL);
+		mmput (in_exec_mm);
+	}
 
 out_file:
 	if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..85192e1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1314,6 +1314,17 @@ int ...
From: Roland McGrath
Date: Sunday, August 29, 2010 - 5:56 pm

IMHO unlimited should mean unlimited.  So, on that score, I'd leave this
constraint out and just say whatever deficiencies in the OOM killer (or in
whatever should make a manifestly too-large allocation get ENOMEM) should
just be fixed separately.

But that aside, I'll just consider the intent stated in the comment in
get_arg_page:
		 * Limit to 1/4-th the stack size for the argv+env strings.
		 * This ensures that:
		 *  - the remaining binfmt code will not run out of stack space,
		 *  - the program will have a reasonable amount of stack left
		 *    to work from.
To effect "1/4th the stack size", a cap at TASK_SIZE/4 does make some sense,
since TASK_SIZE is less than RLIM_INFINITY even in the pure 32-bit world,
and that is the true theoretical limit on stack size.

The trouble here, both for that stated intent, and for this "exploit",
is which TASK_SIZE that is on a biarch machine.  In fact, it's the
TASK_SIZE of the process that called execve.  (get_arg_page is called
from copy_strings, from do_execve before search_binary_handler--i.e.,
before anything has looked at the file to decide whether it's going to
be a 32-bit or 64-bit task on exec.)  If it's a 32-bit process exec'ing
a 64-bit program, it's the 32-bit TASK_SIZE (perhaps as little as 3GB).
So that's a limit of 0.75GB on a 64-bit program, which might actually do
just fine with 2 or 3GB.  If it's a 64-bit process exec'ing a 32-bit
program, it's the 64-bit TASK_SIZE (128TB on x86-64).  So that's a limit
of 32TB, which is perhaps not that helpfully less than 2PB minus 1 byte
(RLIM_INFINITY/4) as far as preventing any over-allocation DoS in practice.

So IMHO your change does marginal harm in some cases (32 execs 64)
and makes no appreciable difference to anyone interested in malice
(who can just dodge by exploiting it via 64 execs 64 or 64 execs 32).

If you want to constrain it this way, it's probably simpler just to use
a smaller hard limit for RLIM_STACK at boot time (and hence system-wide).
But it sounds like all ...
From: Solar Designer
Date: Sunday, August 29, 2010 - 8:23 pm

In general, I'd agree, however let's recall that back in 2.2 days we
introduced strnlen_user() and the "max" argument to count() to prevent a
userspace program from making the kernel loop busily for too long.
IIRC, prior to that fix, I was able to cause the kernel to loop for tens
of minutes in a single execve() call on an Alpha with 128 MB RAM, by
using repeated mappings of the same pages (almost 200 GB total).

Now it appears that, besides the issue that started this thread, the
same problem I mentioned above got re-introduced.  We still have
strnlen_user() and the "max" argument to count(), but we no longer have
hard limits for "max".  Someone set MAX_ARG_STRINGS to 0x7FFFFFFF, and
this is just too much.  MAX_ARG_STRLEN is set to 32 pages, and these two
combined allow a userspace program to make the kernel loop for days.

So I think that we should re-introduce some artificial limit(s), maybe
adjustable by root (by the host system's real root only when container
virtualization is involved).  Maybe we should lower MAX_ARG_STRINGS
and/or maybe we should limit the portion of stack space usable for argv




No, we need both.

Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
stack tries to expand around the address space when shifted".  Perhaps
limiting the stack size would deal with that, but maybe the "bug" needs
to be patched elsewhere as well.  grsecurity has the following hunk:

@@ -505,7 +520,8 @@ static int shift_arg_pages(struct vm_are
 	unsigned long new_end = old_end - shift;
 	struct mmu_gather *tlb;
 
-	BUG_ON(new_start > new_end);
+	if (new_start >= new_end || new_start < mmap_min_addr)
+		return -EFAULT;
 
 	/*
 	 * ensure there are no vmas between where we want to go

which is likely part of a fix (but not the entire fix) for what the
comment in 64bit_dos.c refers to.  However, I was not able to trigger
the BUG_ON() on RHEL5'ish kernels by simply running the 64bit_dos
program (64-bit to 32-bit exec) on two systems, one with 2 GB RAM, ...
From: Roland McGrath
Date: Monday, August 30, 2010 - 3:06 am

And I say, if your userland process could really allocate another 200GB,
then more power to you, you can do it with an exec too.  If you could do
the same with a userland stack allocation, and spend all that time in
strlen calls and then memcpy, you can do it inside execve too.  If it
takes days, that's what you asked for, and it's your process.  It just
ought to be every bit (or near enough) as preemptible and interruptible
as that normal userland activity ought to be.

So, perhaps we want this (count already has a cond_resched in its loop):

diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..0000000 100644  
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -369,6 +369,9 @@ static int count(const char __user * con
 		for (;;) {
 			const char __user * p;
 
+			if (signal_pending(current))
+				return -ERESTARTNOINTR;
+
 			if (get_user(p, argv))
 				return -EFAULT;
 			if (!p)
@@ -400,6 +403,10 @@ static int copy_strings(int argc, const 
 		int len;
 		unsigned long pos;
 
+		if (signal_pending(current))
+			return -ERESTARTNOINTR;
+		cond_resched();
+
 		if (get_user(str, argv+argc) ||
 				!(len = strnlen_user(str, MAX_ARG_STRLEN))) {

I really don't think we need that stuff back.  I think we can get rid of

I don't agree.  If all of the implicit allocation done inside execve is
accounted and controlled as well as normal userland allocations so that
the execve fails when userland allocation would fail, then there is no

That change seems like it might be reasonable, but I haven't really
looked at shift_arg_pages before.  Has someone reported this BUG_ON

Agreed.  But IMHO the missing arbitrary limits on arg/env size are not
among them.  I don't know about shift_arg_pages.  The core fix I think
makes sense is making the nascent mm get accounted to the user process
normally.  Rather than better enabling OOM killing, I think what really
makes sense is for the nascent mm to be marked such that allocations in
it (they'll be from get_arg_page->get_user_pages->handle_mm_fault) ...
From: Solar Designer
Date: Monday, August 30, 2010 - 12:48 pm

This makes sense to me.  However, introducing a new preemption point
may violate assumptions under which the code was written and reviewed
in the past.  In the worst case, we'd introduce/expose race conditions


So, in current kernels, you're making it possible for more kinds of
things to change after prepare_binprm() but before
search_binary_handler().  We'd need to check for possible implications
of this.

I must admit I am not familiar with what additional kinds of things may
change when execution is preempted.  This made a significant difference
in some much older kernels (many years ago), but now that the kernel
makes a lot less use of locking most things may be changed by another
CPU even without preemption.  So does anyone have a list of what
additional risks we're exposed to, if any, when we allow preemption in

64bit_dos.c was supposed to be the reproducer, and I managed to get it
to work (as I've documented in another message earlier today).  The
prerequisites appeared to be (some of these might be specific to my
tests, though):

- 64-bit kernel with 32-bit userland support (e.g., CONFIG_IA32_EMULATION);
- 64-bit build of 64bit_dos.c;
- 32-bit build of the target program;
- no dynamic linking in the target program;
- "ulimit -s unlimited" before running the reproducer program;

I agree.

Thanks,

Alexander
--

From: Roland McGrath
Date: Monday, August 30, 2010 - 5:40 pm

This can't be so.  There are already many possibilities for preemption


What "change"?  Preemption is already possible, that's nothing new.
The only difference is that we might notice TIF_SIGPENDING having been
set, and bail out either before or after prepare_binprm, and so never
call search_binary_handler at all.

The user-visible effect of this could be that a process taking many signals
way too fast could get stuck permanently running its signal handlers and
never actually really attempting the execve (but perhaps doing some mm
setup and arg-copying work and then throwing it away, repeatedly).  Before,
the speed of the signals might have been such that the process could get
into the execve syscall, so it might happen to complete when now it would
be more likely to handle another signal before doing the exec instead.

That, and debuggers might ever see -ERESTARTNOHAND status in syscall
tracing for execve (as they already might for many other syscalls, so they
should be fine).

Note that get_user_pages() already fails for SIGKILL.  So a more
conservative change would be just fatal_signal_pending(current) checks.
That lets only SIGKILL short-circuit the execve to kill the process,
without just letting normal signals interrupt execve like they would
interrupt any other system call.  That should have exactly no
user-visible effects (aside from fairer scheduling in CONFIG_PREEMPT=n
kernels, and better responsiveness to SIGKILL in all kernels).  But I

That's not so much the point, since we already have plenty of places in
this code path that have voluntary preemption points.  We're just adding
more, which improves interactive scheduling performance for the rest of the

I reproduced it too.  The lack of dynamic linking is not really a
requirement to get it to hit, though whether it fails in the kernel can
vary with the stack randomization.  (Sometimes it will instead be the
userland dynamic linker dying with a message that mmap failed.)

I think the right fix for that case ...
From: Roland McGrath
Date: Tuesday, September 7, 2010 - 7:34 pm

This is my take on parts of the execve large arguments copying issues
that Kees posted about, and Brad and others have been discussing.
I've only looked at the narrow area of the argument copying code
itself.  I think these are good and necessary fixes.  But I'm not
addressing the whole OOM killer/mm accounting issue, which also needs
to be fixed (and I have the impression others are already looking into that).

The following changes since commit d56557af19867edb8c0e96f8e26399698a08857f:

  Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 (2010-09-07 16:00:17 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git topic/exec-fixes

Roland McGrath (3):
      setup_arg_pages: diagnose excessive argument size
      execve: improve interactivity with large arguments
      execve: make responsive to SIGKILL with large arguments

 fs/exec.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)


Thanks,
Roland
--

From: Roland McGrath
Date: Tuesday, September 7, 2010 - 7:35 pm

The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
check the size of the argument/environment area on the stack.
When it is unworkably large, shift_arg_pages() hits its BUG_ON.
This is exploitable with a very large RLIMIT_STACK limit, to
create a crash pretty easily.

Check that the initial stack is not too large to make it possible
to map in any executable.  We're not checking that the actual
executable (or intepreter, for binfmt_elf) will fit.  So those
mappings might clobber part of the initial stack mapping.  But
that is just userland lossage that userland made happen, not a
kernel problem.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 fs/exec.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..1b63237 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
 #else
 	stack_top = arch_align_stack(stack_top);
 	stack_top = PAGE_ALIGN(stack_top);
+
+	if (unlikely(stack_top < mmap_min_addr) ||
+	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
+		return -ENOMEM;
+
 	stack_shift = vma->vm_end - stack_top;
 
 	bprm->p -= stack_shift;
--

From: pageexec
Date: Wednesday, September 8, 2010 - 1:29 am

this could arguably elicit some warning, or even better, prevent the sysctl from
setting mmap_min_addr too high in the first place. or alternatively, just check
for

	mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)

which i think is more clear as to the intent of the check. and since the next
line already computes stack_shift as vma->vm_end - stack_top, you could do the
whole check afterwards as:

	mmap_min_addr > vma->vm_start - stack_shift

which is even simpler and more to the point (you want what is called new_start in
shift_arg_pages to not violate mmap_min_addr). now you just need the int overflow
check, say: vma->vm_start < stack_shift, so the whole thing would become:

	if (vma->vm_start < stack_shift || mmap_min_addr > vma->vm_start - stack_shift)
		return -EFAULT;

which looks even better if done in shift_arg_pages as i've done it in PaX:

-       BUG_ON(new_start > new_end);
+       if (new_start >= new_end || new_start < mmap_min_addr)
+               return -EFAULT;

 +	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))

i think the = case is as valid as any other value close to it ;). also this code

i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
couldn't allocate physical memory for something, EFAULT is when there's some
issue with the address space itself (based on the reaction to find_vma or
expand_stack failures).

also what about the BUG_ON in shift_arg_pages? it could go now, right?

personally, i prefer to do this the way i did it in PaX: move up the



--

From: Roland McGrath
Date: Friday, September 10, 2010 - 1:59 am

This code has local checks to ensure that things don't fail worse later.
Those are good to have regardless.  If you'd like to add some constraints
on setting mmap_min_addr, that's certainly fine too.  But it's no reason

IMHO that is far less clear as to what the intent of the check is than what
I wrote.  It's especially subtle that this check allows overflow and then
you check for that later, rather than the formulation I gave where no

My change to setup_arg_pages is consistent with the existing
CONFIG_STACK_GROWSUP code.  IMHO simple fixes should go in first

I disagree.  IMHO, the only proper use of EFAULT is for the passing of a
bad pointer (including passing data that includes a bad pointer, etc.).
So execve really should only use EFAULT when the pointers passed in the
system call are bad.  I used ENOMEM purely for the reason that the
existing CONFIG_STACK_GROWSUP code I paralleled uses it.  But it
certainly does make more sense than EFAULT does.  ENOMEM is used for
cases where you couldn't get the address space you wanted (mmap), which
is a fairly close analogue to this case.

But the one error code that is actually correct for the case is E2BIG.
Obviously that's really the right thing for Argument list too long cases.

The choice of error code here is fairly academic.  Any failure of
setup_arg_pages always leads to sending ourselves a SIGKILL before
returning the error code to percolate back to execve.  So it's only the
error code seen in syscall tracing before the process dies.  It's not
"user-visible" in the usual sense (i.e. POSIX doesn't care what code we

It is now impossible by the logic of the arithmetic, yes.  But it is
another local sanity check asserting the assumptions of the code in that

I'm more comfortable with a fix that doesn't change any other aspect of the
behavior.  Probably moving the call to shift_arg_pages around is fine.  But
IMHO that belongs in a separate set of cleanup patches, and not in the
isolated fix for the BUG_ON ...
From: pageexec
Date: Saturday, September 11, 2010 - 6:30 am

it's somewhat confusing as it does not immediately relate to the problem
you're addressing. let's take a step back and see what issues we have here:

1. the would-be stack vma shifts down so much that it becomes completely
   invalid (due to the int wraparound on its start address).

2. the would be stack vma shifts down to violate mmap_min_addr.

it's obvious that before ensuring 2, 1 must pass. the check for 1 does not
at all involve mmap_min_addr, whereas both of your checks do. even worse, the
check for 2 doesn't depend on the relation between stack_top and mmap_min_addr

why is that? it's obvious that vm_end-vm_size=vm_size and the check simply
ensures that problem 2 above doesn't occur whereas in neither of your checks
can one see the clear intent for ensuring 2. your 2nd check basically ensures
that vm_size fits in the usable address space bounded by [mmap_min_addr, stack_top]
but it doesn't immediately tell the user why that size check is useful (especially
since we're not concerned with sizes per se, but addresses, it's just a corollary
in this particular case that a size check can replace an address check, but that
doesn't mean it'll help any future reader see the original intent). so while
the math is ok in your checks (modulo the = case), it's a kind of premature

no, actually that particular check was for addressing one particular
problem (to show you how to formulate it that makes intent clear), as a
train of thought if you like. the final proposed version (which has been

you're only focused on the math and let the human intent slip by, that's
the issue i have with your checks (and you didn't get the = corner case
right btw, it should be allowed, nor refused if you want to be pedantic).

really, in security clarity of intent and its manifestation in code takes

the GROWSUP code doesn't use unlikely (very confusing to see it in this
context since you're not addressing any hot path/performance issues), nor
does it care about mmap_min_addr (whereas both of your ...
From: Roland McGrath
Date: Tuesday, September 14, 2010 - 12:33 pm

Readability and obviousness are in the eye of the beholder.   
Linus has merged my patches, so this crash is now fixed.  
As I said before, you should by all means send further clean-up
patches if you think they are improvements.

I have no special interest in this area.  I gave my opinions and
advice because I was CC'd, and then submitted some patches myself
because I was asked to.  I can't really keep track of all the
formalities.  The CC list on my submissions represented the set
of people I knew to be involved in the discussion.  I'm sorry if
you felt slighted.  I just pay attention to the code, not the
personalities.  The most reliable way to make sure your name is
associated with a change is to submit patches yourself in a form
that Linus wants to merge.


Thanks,
Roland
--

From: pageexec
Date: Tuesday, September 14, 2010 - 3:35 pm

i was referring to Brad, since he reported this (and the wireless
heap infoleak which he also wasn't credited for). anyway, we won't be
submitting any further vulnerability reports, so no need to worry about

his conditions are unacceptable and therefore it's not going to happen.

--

From: Brad Spengler
Date: Wednesday, September 8, 2010 - 4:57 am

I still don't think this addresses the whole problem.  Without question,
the rlimit / 4 check is bogus.  If nobody agrees with the intent of that 
check, then it should be removed, but I think the better solution is to 
fix the check so that it matches its original intent: let the initial 
stack setup be up to 1/Xth of the min(rlimit, TASK_SIZE dependent upon 
personality), which allows space for additional stack setup in the ELF 
loader and then further growth once the process is live.  If that 
amount is overstepped, then the exec will return an error to the calling 
process instead of being terminated.

It might be useful to consult with the people who introduced/approved 
the check in the first place, as they seemed to have reasons for 
implementing it.

-Brad

From: KOSAKI Motohiro
Date: Wednesday, September 8, 2010 - 10:31 pm

Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
can't change this even though you disliked. That said, we can't break
userland even though userland library is very crazy.

point yet. he seems to want "unlimited" works as "unlimited". then, now
I don't make such patch. Instead, I would propose to insert 
__vm_enough_memory() check in execve() pass. It prevent almost argv attack.



--

From: Roland McGrath
Date: Friday, September 10, 2010 - 2:25 am

I'm sorry you think it's "very crazy" to implement the required
functionality in the only way available.  POSIX requires that execve
fail with E2BIG when the ARG_MAX limit is exceeded.  sysconf has to
return the correct actual limit that execve will enforce so that a
conforming application knows how much it can safely attempt to use.
Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
not expose the true manifest limit any other way, sysconf has to
parallel the kernel's calculation.


Thanks,
Roland
--

From: KOSAKI Motohiro
Date: Friday, September 10, 2010 - 2:43 am

Hmm...
Probably my poor english leaded to misunderstood. I didn't intent glibc
is very crazy. I only intended to "even if userland is crazy, I disagree
to break userland".

And yes, we obviously need to expose ARG_MAX limit to libc. a duplicated
heuristic code easily makes confusion and mistake. nobody want such 
fragile state. however, it's a bit offtopic. anyway.


Thanks.

--

From: Roland McGrath
Date: Tuesday, September 14, 2010 - 11:51 am

I was referring to the ways available to userland heretofore.  Certainly,
the kernel could add new ways and then userland could do different things
(with new kernels).  

auxv in particular is not a mechanism that could fit for this.  The actual
limit depends on rlimits of the calling process, and rlimits can change
during the life of the program.  auxv is only appropriate for things that
are known at the time of the exec and won't change thereafter.


Thanks,
Roland
--

From: pageexec
Date: Tuesday, September 14, 2010 - 1:28 pm

obviously an AT_ARGMAX computed at execve time would be based on the rlimits
as well and if later userland changed the rlimits, it'd be userland's problem,
not that of the kernel (or the kernel could refuse a change that would violate

you mean stuff like AT_EUID et al.? ;)

--

From: Roland McGrath
Date: Tuesday, September 14, 2010 - 2:16 pm

This would thoroughly defeat the purpose of adding the thing.  The
only reason to have a new thing is so that userland does not have to
mirror the kernel's policy (as it attempts to do now, with the 1/4
calculation).  If the new thing is not something that userland can
use consistently so as not to have to know what the kernel's actual

The information that these give is about the conditions at startup.
That's what they mean to userland, and userland only uses them to know
the situation before it has made any calls.  The definition of AT_EUID
is "effective user ID at program startup", and that fact does not
change.  You proposed AT_ARGMAX for a purpose that requires knowing
the current information in the process at the time it might attempt an
execve call, not at startup.  It is not an equivalent case.


Thanks,
Roland

--

From: pageexec
Date: Tuesday, September 14, 2010 - 3:27 pm

userland could never rely on the kernel's policy at all since get_arg_page
could have failed for more reasons than overstepping the currently hardcoded
ARG_MAX check in there. so what AT_ARGMAX would buy us is to allow the kernel
policy to change over time, but it's never been about guarantees, whether

just for my own curiosity, where does this definition come from?

--

From: pageexec
Date: Wednesday, September 15, 2010 - 2:27 am

yes but it's not only OOM (ENOMEM from some allocation), but it can be also
EPERM from LSM (if mmap_min_addr is set too high) or EFAULT from get_user_pages
(e.g., if VM_FAULT_HWPOISON was returned for a requested page).

--

From: Roland McGrath
Date: Friday, September 10, 2010 - 2:18 am

> I still don't think this addresses the whole problem.  

You're responding to one of three patches, and I said to begin with that
all together they only address part of the problem (not the OOM part).

I question that assertion.  For a non-RLIM_INFINITY limit, there is nothing
in particular wrong with it.  The kernel is free to pick its upper bound
for ARG_MAX by whatever method.  I don't see anything much to object to
about the rlimit/4 method.  It has no useful effect for RLIM_INFINITY and
IMHO should not try to impose any limit in that case.  But that's the only

I see no reason to suspect this was the "original intent".  It seems most
likely to me that the original intent was 1/4th the RLIMIT_STACK size, and

That's what happens now when RLIMIT_STACK is smaller, and that's what
people really care about.  What you suggest would require some more
significant changes to the exec code path, touching all the binfmt modules
(though probably only binfmt_elf matters).  

In the current structure of the code, the arch-dependent SET_PERSONALITY
macro used by {compat_,}binfmt_elf is the only place that knows what arch
bits to set for the new address space size.  This is itself destructive,
but also runs after flush_old_exec (the point of no return).  So you'd have
to reorganize things significantly, or add an entirely new arch macro tied

This was done in commit b6a2fea by Ollie Wild <aaw@google.com>:
	mm: variable length argument support
It was part of going from a fixed maximum to no fixed maximum.
The log includes:
    [a.p.zijlstra@chello.nl: limit stack size]
So perhaps it was Peter who devised the rlimit/4 idea.


Thanks,
Roland
--

From: Roland McGrath
Date: Tuesday, September 7, 2010 - 7:36 pm

This adds a preemption point during the copying of the argument and
environment strings for execve, in copy_strings().  There is already
a preemption point in the count() loop, so this doesn't add any new
points in the abstract sense.

When the total argument+environment strings are very large, the time
spent copying them can be much more than a normal user time slice.
So this change improves the interactivity of the rest of the system
when one process is doing an execve with very large arguments.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 fs/exec.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 1b63237..6f2d777 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -419,6 +419,8 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 		while (len > 0) {
 			int offset, bytes_to_copy;
 
+			cond_resched();
+
 			offset = pos % PAGE_SIZE;
 			if (offset == 0)
 				offset = PAGE_SIZE;
--

From: Roland McGrath
Date: Tuesday, September 7, 2010 - 7:37 pm

An execve with a very large total of argument/environment strings
can take a really long time in the execve system call.  It runs
uninterruptibly to count and copy all the strings.  This change
makes it abort the exec quickly if sent a SIGKILL.

Note that this is the conservative change, to interrupt only for
SIGKILL, by using fatal_signal_pending().  It would be perfectly
correct semantics to let any signal interrupt the string-copying in
execve, i.e. use signal_pending() instead of fatal_signal_pending().
We'll save that change for later, since it could have user-visible
consequences, such as having a timer set too quickly make it so that
an execve can never complete, though it always happened to work before.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 fs/exec.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 6f2d777..828dd24 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -376,6 +376,9 @@ static int count(const char __user * const __user * argv, int max)
 			argv++;
 			if (i++ >= max)
 				return -E2BIG;
+
+			if (fatal_signal_pending(current))
+				return -ERESTARTNOHAND;
 			cond_resched();
 		}
 	}
@@ -419,6 +422,10 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 		while (len > 0) {
 			int offset, bytes_to_copy;
 
+			if (fatal_signal_pending(current)) {
+				ret = -ERESTARTNOHAND;
+				goto out;
+			}
 			cond_resched();
 
 			offset = pos % PAGE_SIZE;
--

From: KOSAKI Motohiro
Date: Tuesday, September 7, 2010 - 8:00 pm

All of changes looks nice to me :)
Thanks.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--

From: KOSAKI Motohiro
Date: Wednesday, September 8, 2010 - 10:01 pm

Now, we have two OOM-Killer/mm acounting problem.
 1) OOM-killer doesn't track nascent mm and It may kill innocent task
 2) When execve argument-copying, our __vm_enough_memory() doesn't
    protect any wrong plenty argument. then, execve() invoke OOM instead
    return failure value when larger argument than system memory.

The patch series addressed this two issue.



--

From: KOSAKI Motohiro
Date: Wednesday, September 8, 2010 - 10:03 pm

This patch was made on top "oom: remove totalpage normalization from oom_badness()" patch.

===============================
Execve() makes new mm struct and setup stack and push argv vector,
Unfortunately this nascent mm is not pointed any tasks, then
OOM-killer can't detect this memory usage. therefore OOM-killer
may kill incorrect task.

Thus, this patch added task->in_exec_mm member and track
nascent mm usage.

Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Eugene Teo <eteo@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/compat.c             |    4 +++-
 fs/exec.c               |   14 +++++++++++++-
 include/linux/binfmts.h |    1 +
 include/linux/sched.h   |    1 +
 mm/oom_kill.c           |   37 +++++++++++++++++++++++++++++--------
 5 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 718c706..b631120 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1567,8 +1567,10 @@ int compat_do_execve(char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
+	if (bprm->mm) {
+		set_exec_mm(NULL);
 		mmput(bprm->mm);
+	}
 
 out_file:
 	if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..b41834c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -347,6 +347,8 @@ int bprm_mm_init(struct linux_binprm *bprm)
 	if (err)
 		goto err;
 
+	set_exec_mm(bprm->mm);
+
 	return 0;
 
 err:
@@ -983,6 +985,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 		goto out;
 
 	bprm->mm = NULL;		/* We're using it now */
+	set_exec_mm(NULL);
 
 	current->flags &= ~PF_RANDOMIZE;
 	flush_thread();
@@ -1314,6 +1317,13 @@ int search_binary_handler(struct linux_binprm *bprm,struct pt_regs *regs)
 
 EXPORT_SYMBOL(search_binary_handler);
 
+void set_exec_mm(struct mm_struct *mm)
+{
+	task_lock(current);
+	current->in_exec_mm = mm;
+	task_unlock(current);
+}
+
 /*
  * sys_execve() ...
From: Oleg Nesterov
Date: Thursday, September 9, 2010 - 3:05 pm

Oh, yes, this has to be per-thread (unlike ->mm, btw).

I wonder if it makes sense to move ->cred_guard_mutex from task_struct
to signal_struct and thus make multiple-threads-inside-exec impossible.
Only one thread can win anyway.

Oleg.

--

From: Roland McGrath
Date: Friday, September 10, 2010 - 2:39 am

That probably makes sense.  Note that cred_guard_mutex is also overloaded
for ptrace_attach, so this would add some more serialization of attaches to
threads in the same group.  But as long as actual attachment serializes on
tasklist_lock anyway, it doesn't make a material difference.  (Even without
that, it would presumably be the same debugger attaching serially to
threads in the same group, so it wouldn't degrade anything in practice.)


Thanks,
Roland
--

From: KOSAKI Motohiro
Date: Friday, September 10, 2010 - 2:57 am

Interesting idea, really.
I've thought

1) moving cread_guard_mutex itself
   - no increase execve overhead
	-> very good
   - it also prevent parallel ptrace
	-> probably no problem. nobody want parallel debugging
2) move in_exec_mm to signal_struct too
   -> very hard. oom-killer can use very few lock because it's called
      from various place. now both ->mm and ->in_exec_mm are protected
      task_lock() and it help to avoid messy.


So, I've done only (1). I think this restriction effectivly prevent 
some theorical execve() resouce smashing attack.


======================================================================
Subject: [PATCH] move cred_guard_mutex from task_struct to signal_struct

Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent execve()
has no worth.

Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/exec.c                 |    8 ++++----
 fs/proc/base.c            |    8 ++++----
 include/linux/init_task.h |    4 ++--
 include/linux/sched.h     |    7 ++++---
 kernel/cred.c             |    2 --
 kernel/fork.c             |    2 ++
 kernel/ptrace.c           |    4 ++--
 7 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 81d0d06..052ed54 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1091,14 +1091,14 @@ EXPORT_SYMBOL(setup_new_exec);
  */
 int prepare_bprm_creds(struct linux_binprm *bprm)
 {
-	if (mutex_lock_interruptible(&current->cred_guard_mutex))
+	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
 ...
From: Oleg Nesterov
Date: Friday, September 10, 2010 - 10:24 am

No, it doesn't. Only PTRACE_ATTACH needs this mutex, and as Roland
pointed out it also needs write_lock(tasklist) which is worse. So

Yes. But, if ->in_exec_mm is only used by oom_badness(), then I think
you can use task_lock(tsk->group_leader). oom_badness() needs tasklist
anyway, this means it can't race with de_thread() changing the leader.
But up to you.

Another very minor nit (but again, up to you). Perhaps exec_mmap()
could clear ->in_exec_mm (in task_struct or signal_struct, this doesnt
matter), it takes task_lock(current) anyway (and at this point current

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


This is very minor, but perhaps you can also fix a couple of comments
which mention task->cred_guard_mutex,

	fs/exec.c:1109		the caller must hold current->cred_guard_mutex
	kernel/cred.c:328	The caller must hold current->cred_guard_mutex
	include/linux/tracehook.h:153	@task->cred_guard_mutex

Oleg.

--

From: KOSAKI Motohiro
Date: Wednesday, September 15, 2010 - 10:51 pm

Will fix, of cource.




--

From: Linus Torvalds
Date: Friday, September 10, 2010 - 8:06 am

On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro

This is horrible. We don't want to walk the arguments one more time
just for this. Let's just improve the checks that we do as we go
along.

                            Linus
--

From: KOSAKI Motohiro
Date: Monday, September 13, 2010 - 6:52 pm

Okey. I'll consider new way in this night.



--

From: KOSAKI Motohiro
Date: Wednesday, September 15, 2010 - 10:51 pm

After while thinking, I decided to just drop this idea. because
 1) If one pass check is must, we can't reuse vm-overcommit check.
 2) Glibc has the duplicated hueristic, then we can't change it nor
    introduce new hard limit. (Sh*t)
 3) This is not must fix, it only mitigate a pain when accidental large
    argv case. Only OOM fixes enough care intended attack case.
 4) distro can change default of rlim_max of RLIMIT_STACK. It protect
    from RLIM_INFINITY smash.

Briefly says, to introduce new limit has bad benefit/risk balance. Sadly.



--

From: Linus Torvalds
Date: Thursday, September 16, 2010 - 8:01 am

Well, I mostly agree. That said, I do think we could extend the
limiter some ways.

For example, I think the "stack limit / 4" is perfectly sane, but it
would make total sense to perhaps also take into account the AS and
RSS limits.

And I do think that your attempt to use __vm_enough_memory() was good.
It happens to be coded in a way that makes it useless for a one-pass
model, and some of what it does would be too expensive to do up-front
when you can't short-circuit it, but I do think that it would probably
be appropriate to at least try to take the _rough_ code there and use
it as a limit for maximum stack size too.

For example, we could have a function somewhat like

    unsigned long max_stack_size(void)
   {
        unsigned long allowed, used, limit;

        switch (sysctl_overcommit_memory) {
        case OVERCOMMIT_ALWAYS:
                allowed = ULONG_MAX;
                break;
        case OVERCOMMIT_GUESS:
                .. maybe we can come up with some upper bound here too ..
                break;
        default:
                allowed = (totalram_pages - hugetlb_total_pages())
                        * sysctl_overcommit_ratio / 100;
                if (!cap_sys_admin)
                        allowed -= allowed / 32;
                allowed += total_swap_pages;
                /* Don't let a single process grow too big:
                   leave 3% of the size of this process for other processes */
                if (mm)
                        allowed -= mm->total_vm / 32;
                /* What is already committed to? */
                used = percpu_counter_read_positive(&vm_committed_as);
                if (used > allowed)
                        return 0;
                allowed -= used;
                break;
        }
        limit = ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4;
        if (allowed > limit)
                allowed = limit;
        return allowed;
    }

which we'd call once at the beginning of the execve(), ...
From: Solar Designer
Date: Monday, August 30, 2010 - 10:49 am

I am finally able to trigger the BUG by replacing "/bin/sh" with
"/bin/false" in 64bit_dos.c, relying on our library-free implementation
of /bin/false on Owl:

owl!solar:~$ objdump -d /bin/false

/bin/false:     file format elf32-i386

Disassembly of section .text:

08048074 <.text>:
 8048074:       b8 01 00 00 00          mov    $0x1,%eax
 8048079:       bb 01 00 00 00          mov    $0x1,%ebx
 804807e:       cd 80                   int    $0x80

owl!solar:~$ file 64bit_dos
64bit_dos: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, statically linked, for GNU/Linux 2.4.0, stripped

(the "exploit" is statically-linked since I brought it from another
machine, I don't think this matters).

After looping in the kernel for about 10 seconds, I got:

Kernel BUG at fs/exec.c:535
[...]
Call Trace:
 [<ffffffff8007c44c>] load_elf32_binary+0x7f9/0x1702
 [<ffffffff800c193f>] expand_stack+0x7f/0xad
 [<ffffffff8003e39b>] search_binary_handler+0x94/0x1e2
 [<ffffffff8003da2f>] do_execve+0x18e/0x1f2
 [<ffffffff800542cc>] sys_execve+0x34/0x51
 [<ffffffff8005e523>] stub_execve+0x67/0xb0
[...]
Code: 0f 0b 68 fe 93 3e 80 c2 17 02 48 8b 7c 24 08 4c 89 fe e8 da 
RIP  [<ffffffff8002dc90>] setup_arg_pages+0x151/0x2d3

The kernel is a revision and custom build of 2.6.18-128.2.1.el5
(whatever I happened to readily have installed on a test system).
I think the problem should be reproducible with current RHEL5 kernels
and likely with latest mainstream kernels as well.

The process is stuck:

solar    28754  2.8 77.8 3142276 3142276 pts/0 D+   10:34   0:13 [false]

(uninterruptible)

3 GB memory is still taken.

Alexander
--

From: Brad Spengler
Date: Monday, August 30, 2010 - 3:08 pm

Hi guys,

I see you're having fun with my code ;)  Just wanted to remind you that 
I do exist; I reported this in December 2009 to Ted Tso and again 
recently (forwarded the same email from 2009) to Kees Cook and James 
Morris.  So even though nobody's actually emailed me about the issue(s), 
I am available to answer any questions.  Just CC me on the email as I'm 
not subscribed to the list.

Anyway, I did actually research the bug(s) involved quite a bit around 
the time I reported it, so hopefully some of the below will help.

The bug seems to have been introduced in 2.6.23, see:
http://thread.gmane.org/gmane.linux.ports.hppa/752
http://www.spinics.net/lists/linux-arch/msg01584.html
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg170491.html
though I'm guessing the functionality was also backported to major 
distros

The check using the current stack limit as a byte value (including when 
it's RLIMIT_INFINITY) by dividing it by 4 is completely broken for a 
number of reasons:
1/4th of a 64bit ~0 is several times larger than the size of addressable 
userland address space on most 64bit architectures (amd64 is 47 bits)
1/4th of a 64bit ~0 is way bigger than any 32bit value
No other place in the kernel treats a limit in this way
With a high rlimit and high usage, the code doesn't do what it intends 
to do -- be able to return a meaningful error message to execve, instead 
of having the process doing the execve terminated with a SIGKILL in 
later stages of ELF loading.

Combined with the fact that the max arg size * max number of args 
(~256TB) is also again larger than the 32bit address space and the 
amd64 userland address space (as well as the address space of several other 
64bit architectures), lots of problems appear.  This was exacerbated by 
the behavior that, until recently fixed, allowed the stack to grow over 
any other existing mappings.  Combine this with ASLR and shifting that 
whole stack range down a random amount via shift_arg_pages/adjust_vma ...
From: Solar Designer
Date: Tuesday, August 31, 2010 - 4:53 am

Brad, Roland -

Thank you for your comments and your work on this.


Agreed.

Alexander
--

From: Tetsuo Handa
Date: Tuesday, August 31, 2010 - 4:56 am

As far as I know, RHEL >= 5.3 and Asianux >= 3.2 backported this functionality.
--

Previous thread: [PATCH] regulator: Update e-mail address for Liam Girdwood by Mark Brown on Friday, August 27, 2010 - 2:33 pm. (2 messages)

Next thread: fs hang in 2.6.27.y | Fwd: [Bug 15658] New: [PATCH] x86 constant_test_bit() prone to misoptimization with gcc-4.4 by Michael Shigorin on Friday, August 27, 2010 - 2:36 pm. (7 messages)