Re: [PATCH 2/2] exec: copy-and-paste the fixes into compat_do_execve() paths

Previous thread: [resend][PATCH] mm: increase RECLAIM_DISTANCE to 30 by KOSAKI Motohiro on Sunday, October 24, 2010 - 8:24 pm. (2 messages)

Next thread: linux-next: Tree for October 25 by Stephen Rothwell on Sunday, October 24, 2010 - 8:58 pm. (14 messages)
From: KOSAKI Motohiro
Date: Sunday, October 24, 2010 - 8:26 pm

Current oom_score_adj is completely broken because It is strongly bound
google usecase and ignore other all.

1) Priority inversion
   As kamezawa-san pointed out, This break cgroup and lxr environment.
   He said,
	> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
	> And A uses 200M, B uses 1G of memory under 4G system
	>
	> Under the system.
	> 	A's socre = (200M *1000)/4G + 300 = 350
	> 	B's score = (1G * 1000)/4G = 250.
	>
	> In the cpuset, it has 2G of memory.
	> 	A's score = (200M * 1000)/2G + 300 = 400
	> 	B's socre = (1G * 1000)/2G = 500
	>
	> This priority-inversion don't happen in current system.

2) Ratio base point don't works large machine
   oom_score_adj normalize oom-score to 0-1000 range.
   but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
   1GB. this is no suitable for tuning parameter.
   As I said, proposional value oriented tuning parameter has
   scalability risk.

3) No reason to implement ABI breakage.
   old tuning parameter mean)
	oom-score = oom-base-score x 2^oom_adj
   new tuning parameter mean)
	oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)
   but "oom_score_adj / (totalram + totalswap)" can be calculated in
   userland too. beucase both totalram and totalswap has been exporsed by
   /proc. So no reason to introduce funny new equation.

4) totalram based normalization assume flat memory model.
   example, the machine is assymmetric numa. fat node memory and thin
   node memory might have another wight value.
   In other word, totalram based priority is a one of policy. Fixed and
   workload depended policy shouldn't be embedded in kernel. probably.

Then, this patch remove *UGLY* total_pages suck completely. Googler
can calculate it at userland!

Cc: stable@kernel.org
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c        |   35 +++----------
 include/linux/oom.h   |   16 +-----
 include/linux/sched.h |    2 +-
 mm/oom_kill.c         |  142 ...
From: KOSAKI Motohiro
Date: Sunday, October 24, 2010 - 8:27 pm

oom_adj is not only used for kernel knob, but also used for
application interface. Then, adding new knob is no good
reason to deprecate it. Don't do stupid!

Also, after former patch, oom_score_adj can't be used for setting
OOM_DISABLE. We need "echo -17 > /proc/<pid>/oom_adj" thing.

This reverts commit 51b1bd2ace1595b72956224deda349efa880b693.

Cc: stable@kernel.org
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/feature-removal-schedule.txt |   25 -------------------------
 Documentation/filesystems/proc.txt         |    3 ---
 fs/proc/base.c                             |    8 --------
 include/linux/oom.h                        |    3 ---
 4 files changed, 0 insertions(+), 39 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 9961f15..1cba5b8 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -151,31 +151,6 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
-What:	/proc/<pid>/oom_adj
-When:	August 2012
-Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
-	badness heuristic used to determine which task to kill when the kernel
-	is out of memory.
-
-	The badness heuristic has since been rewritten since the introduction of
-	this tunable such that its meaning is deprecated.  The value was
-	implemented as a bitshift on a score generated by the badness()
-	function that did not have any precise units of measure.  With the
-	rewrite, the score is given as a proportion of available memory to the
-	task allocating pages, so using a bitshift which grows the score
-	exponentially is, thus, impossible to tune with fine granularity.
-
-	A much more powerful interface, /proc/<pid>/oom_score_adj, was
-	introduced with the oom killer rewrite that allows users to increase or
-	decrease the badness() score linearly.  This interface will ...
From: David Rientjes
Date: Monday, October 25, 2010 - 1:40 pm

NACK as a logical follow-up to my NACK for "oom: remove totalpage 
normalization from oom_badness()"
--

From: KOSAKI Motohiro
Date: Tuesday, October 26, 2010 - 6:01 am

Huh?

I requested you show us justification. BUT YOU DIDNT. If you have any 
usecase, show us RIGHT NOW. 

Don't speaking sucking crap. If you don't want ignore you. you are busy
than you.

DONT STUPID.


--

From: David Rientjes
Date: Tuesday, October 26, 2010 - 12:37 pm

The new tunable added in 2.6.36, /proc/pid/oom_score_adj, is necessary for 
the units that the badness score now uses.  We need a tunable with a much 
higher resolution than the oom_adj scale from -16 to +15, and one that 
scales linearly as opposed to exponentially.  Since that tunable is much 
more powerful than the oom_adj implementation, which never made any real 
sense for defining oom killing priority for any purpose other than 
polarization, the old tunable is deprecated for two years.
--

From: KOSAKI Motohiro
Date: Monday, November 1, 2010 - 12:06 am

The reason that you ware NAKed was not to introduce new powerful feature.

You haven't tested your patch at all. Distro's initram script are using
oom_adj interface and latest kernel show pointless warnings 
"/proc/xx/oom_adj is deprecated, please use /proc/xx/oom_score_adj instead."
at _every_ boot time.

As I said, DON'T SEND UNTESTED PATCH! DON'T BREAK USERLAND CARELESSLY!


--

From: David Rientjes
Date: Monday, November 1, 2010 - 12:36 pm

Linux users who care about prioritizing tasks for oom kill with a tunable 
that (1) has a unit, (2) has a higher resolution, and (3) is linear and 
not exponential.  Memcg doesn't solve this issue without incurring a 1% 

No, it doesn't, and you completely and utterly failed to show a single 
usecase that broke as a result of this because nobody can currently use 
oom_adj for anything other than polarization.  Thus, there's no backwards 

Yes, I've tested it, and it deprecates the tunable as expected.  A single 
warning message serves the purpose well: let users know one time without 
being overly verbose that the tunable is deprecated and give them 
sufficient time (2 years) to start using the new tunable.  That's how 
deprecation is done.
--

From: KOSAKI Motohiro
Date: Monday, November 8, 2010 - 7:26 pm

No. Majority user don't care. You only talk about your case. Don't ignore

Look at a real.
All major distributions has already turn on memcg. End user don't need

No. I showed. 
1) Google code search showed some application are using this feature.
	http://www.google.com/codesearch?as_q=oom_adj&btnG=Search+Code&hl=ja&as_pa...

2) Not body use oom_adj other than polarization even though there are a few.
   example, kde are using.
	http://www.google.com/codesearch/p?hl=ja#MPJuLvSvNYM/pub/kde/unstable/snapshots/kdelib...

When you are talking polarization issue, you blind a real. Don't talk your dream.

3) udev are using this feature. It's one of major linux component and you broke.

http://www.google.com/codesearch/p?hl=ja#KVTjzuVpblQ/pub/linux/utils/kernel/hotplug/ud...

You don't have to break our userland. you can't rewrite or deprecate 

no sense.

Why do their application need to rewrite for *YOU*? Okey, you will got
benefit from your new knob. But NOBDOY use the new one. and People need
to rewrite their application even though no benefit. 

Don't do selfish userland breakage!



--

From: KOSAKI Motohiro
Date: Monday, November 8, 2010 - 8:28 pm

And you said you ignore bug even though you have seen it. It suck!



--

From: KOSAKI Motohiro
Date: Sunday, November 14, 2010 - 5:24 pm

At v2.6.36-rc1, oom-killer doesn't work at all because YOU BROKE.
And I was working on fixing it.

2010-08-19
http://marc.info/?t=128223176900001&r=1&w=2
http://marc.info/?t=128221532700003&r=1&w=2
http://marc.info/?t=128221532500008&r=1&w=2

However, You submitted new crap before the fixing. 

2010-08-15
http://marc.info/?t=128184669600001&r=1&w=2

If you tested mainline a bit, you could find the problem quickly.
You should have fixed mainline kernel at first.


	Again, YOU HAVEN'T TESTED YOUR OWN PATCH AT ALL.





--

From: David Rientjes
Date: Monday, November 15, 2010 - 2:59 am

This existed before my oom killer rewrite, it was only noticed because the 

Yes, tasklist_lock was dropped in a mismerge of my patches when posting 

Yes, if a task was racing between oom_kill_process() and oom_kill_task() 
and all threads had dropped its mm between calls then there was a NULL 

This isn't "crap", this is a necessary bit to ensure that tasks that share 
an ->mm with a task immune from kill aren't killed themselves since we 
can't free the memory.  We came to the consensus that it would be better 
to count the tasks that are OOM_DISABLE in the mm_struct to avoid the 

Thanks for finding a couple fixes during the 2.6.36-rc1 when the rewrite 
was first merged, it's much appreciated!
--

From: David Rientjes
Date: Tuesday, November 9, 2010 - 4:33 pm

If they don't care, then they won't be using oom_adj, so you're point 
about it's deprecation is irrelevant.

Other users do want a more powerful userspace interface with a unit and 
higher resolution (I am one of them), there's no requirement that those 

Memcg also has a command-line disabling option to avoid incurring this 1% 

oom_adj isn't removed, it's deprecated.  These users are using a 
deprecated interface and have a few years to convert to using the new 

I don't understand what you're trying to say here, but the current users 
of oom_adj that aren't +15 or -16 (or OOM_DISABLE) are arbitrary based 
relative to other tasks such as +5, +10, etc.  They don't have any 
semantics other than being arbitrarily relative because it doesn't work in 

That's incorrect, I didn't break anything by deprecating a tunable for a 
few years.  oom_adj gets converted roughly into an equivalent (but linear) 
oom_score_adj.

Unfortunately for your argument, you can't show a single example of a 
current oom_adj user that has a scientific calculation behind its value 

It's deprecated for a few years so users can gradually convert to the new 
tunable, it wasn't removed when the new one was introduced.  A higher 
resolution tunable that scales linearly with a unit is an advantage for 
Linux (for the minority of users who care about oom killing priority 
beyond the heuristic) and I think a few years is enough time for users to 
do a simple conversion to the new tunable.
--

From: Alan Cox
Date: Tuesday, November 9, 2010 - 4:35 pm

Documentation/ABI/obsolete/

should have all obsoletes in it.

Alan
--

From: David Rientjes
Date: Tuesday, November 9, 2010 - 4:48 pm

Good point, the only documentation right now is in 
Documentation/feature-removal-schedule.txt and in the kernel log the first 
time oom_adj is written.  I'll generate a patch, thanks!
--

From: David Rientjes
Date: Tuesday, November 9, 2010 - 4:55 pm

/proc/pid/oom_adj was deprecated in August 2010 with the introduction of
the new oom killer heuristic.

This patch copies the Documentation/feature-removal-schedule.txt entry
for this tunable to the Documentation/ABI/obsolete directory so nobody
misses it.

Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/ABI/obsolete/proc-pid-oom_adj |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/obsolete/proc-pid-oom_adj

diff --git a/Documentation/ABI/obsolete/proc-pid-oom_adj b/Documentation/ABI/obsolete/proc-pid-oom_adj
new file mode 100644
--- /dev/null
+++ b/Documentation/ABI/obsolete/proc-pid-oom_adj
@@ -0,0 +1,22 @@
+What:	/proc/<pid>/oom_adj
+When:	August 2012
+Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
+	badness heuristic used to determine which task to kill when the kernel
+	is out of memory.
+
+	The badness heuristic has since been rewritten since the introduction of
+	this tunable such that its meaning is deprecated.  The value was
+	implemented as a bitshift on a score generated by the badness()
+	function that did not have any precise units of measure.  With the
+	rewrite, the score is given as a proportion of available memory to the
+	task allocating pages, so using a bitshift which grows the score
+	exponentially is, thus, impossible to tune with fine granularity.
+
+	A much more powerful interface, /proc/<pid>/oom_score_adj, was
+	introduced with the oom killer rewrite that allows users to increase or
+	decrease the badness() score linearly.  This interface will replace
+	/proc/<pid>/oom_adj.
+
+	A warning will be emitted to the kernel log if an application uses this
+	deprecated interface.  After it is printed once, future warnings will be
+	suppressed until the kernel is rebooted.
--

From: KOSAKI Motohiro
Date: Sunday, November 14, 2010 - 5:22 pm

NAK. You seems to think shouting claim makes some effect. but It's incorrect.

Incorrect. oom_adj and oom_score_adj have different concept and different abstraction.





--

From: David Rientjes
Date: Monday, November 15, 2010 - 3:38 am

The tunable is deprecated.  If you are really that concerned about the 
existing users who you don't think can convert in the next two years, why 
don't you help them convert?  That fixes the issue, but you're not 
interested in that.  I offered to convert any open-source users you can 
correct even when the kernel is obviously moving in a different direction.

Others may have a different opinion of who is being childish in this whole 
ordeal.
--

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 12:16 am

Why don't you change by _your_ hand? 

_Usually_ userland software changed at first _by_ who wanted the change.
Example, we fujitsu changed elf core file format when vma are >65536, but
It was not made any breakage. we changed gdb, binutils, elfutils and etc etc



--

From: KOSAKI Motohiro
Date: Saturday, November 13, 2010 - 10:07 pm

No irrelevant. Your patch break their environment even though




Even if you don't understand, they are IN THE WORLD. you don't have to


no sense.



--

From: David Rientjes
Date: Sunday, November 14, 2010 - 2:39 pm

The _only_ difference too oom_adj since the rewrite is that it is now 
mapped on a linear scale rather than an exponential scale.  That's because 
the heuristic itself has a defined range [0, 1000] that characterizes the 
memory usage of the application it is ranking.  To show any breakge, you 
would have to show how oom_adj values being used by applications are based 
on a calculated value that prioritizes those tasks amongst each other.  
With the exponential scale, that's nearly impossible because of the number 
of arbitrary heuristics that were used before oom_adj were considered 
(runtime, nice level, CAP_SYS_RAWIO, etc).

So don't talk about userspace breakage when you can't even describe it or 
present a single usecase.
--

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 12:16 am

But, No people have agreed your powerfulness even though you talked about
the same explanation a lot of times.

Again, IF you need to [0 .. 1000] range, you can calculate it by your
application. current oom score can be get from /proc/pid/oom_score and
total memory can be get from /proc/meminfo. You shouldn't have break

Huh? Remember! your feature have ZERO user.



--

From: David Rientjes
Date: Saturday, November 27, 2010 - 6:41 pm

Because NOTHING breaks with the new mapping.  Eight months later since 
this was initially proposed on linux-mm, you still cannot show a single 
example that depended on the exponential mapping of oom_adj.  I'm not 
going to continue responding to your criticism about this point since your 

That would require the userspace tunable to be adjusted anytime a task's 
mempolicy changes, its nodemask changes, it's cpuset attachment changes, 
its mems change, a memcg limit changes, etc.  The only constant is the 
task's priority, and the current oom_score_adj implementation preserves 
that unless explicitly changed later by the user.  I completely understand 
that you may not have a use for this.
--

From: KOSAKI Motohiro
Date: Tuesday, November 30, 2010 - 6:03 am

All situation can be calculated on userland. User process can be know



--

From: David Rientjes
Date: Tuesday, November 30, 2010 - 1:07 pm

Nothing breaks.  If something did, you could respond to my answer above 
and provide a single example of a real-world example that broke as a 

Yes, but the proportional priority-based oom_score_adj values allow users 
to avoid recalculating and writing that value anytime a mempolicy 
attachment changes, its nodemask changes, it moves to another cpuset, its 
set of mems changes, its memcg attachment changes, its limit is modiifed, 
etc.
--

From: KOSAKI Motohiro
Date: Sunday, October 24, 2010 - 8:28 pm

Changelog
  o since v1
    - function comment also change current->cred_guard_mutex to
      current->signal->cred_guard_mutex.

---------------------------------------------------------------------------
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
execve() has no worth.

Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.

Cc: stable@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/exec.c                 |   10 +++++-----
 fs/proc/base.c            |    8 ++++----
 include/linux/init_task.h |    4 ++--
 include/linux/sched.h     |    7 ++++---
 include/linux/tracehook.h |    2 +-
 kernel/cred.c             |    4 +---
 kernel/fork.c             |    2 ++
 kernel/ptrace.c           |    4 ++--
 8 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 6d2b6f9..94dabd2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1078,14 +1078,14 @@ EXPORT_SYMBOL(setup_new_exec);
  */
 int prepare_bprm_creds(struct linux_binprm *bprm)
 {
-	if (mutex_lock_interruptible(&current->cred_guard_mutex))
+	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
 
-	mutex_unlock(&current->cred_guard_mutex);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 	return -ENOMEM;
 }
 
@@ -1093,7 +1093,7 @@ void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		mutex_unlock(&current->cred_guard_mutex);
+		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
 	kfree(bprm);
@@ -1114,13 +1114,13 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be ...
From: Roland McGrath
Date: Monday, October 25, 2010 - 10:26 am

This has my ACK if Oleg doesn't see any problems.

Thanks,
Roland
--

From: Oleg Nesterov
Date: Monday, October 25, 2010 - 10:42 am

I believe the patch is fine (it already has my reviewed-by).

Except: I am not sure about -stable. At least, this patch should
not go into the <2.6.35 kernels, it relies on misc changes which
changed the scope of task->signal. Before 2.6.35 almost any user
of ->cred_guard_mutex can race with exit and hit ->signal == NULL.

Oleg.

--

From: Roland McGrath
Date: Monday, October 25, 2010 - 10:51 am

I see no justification for a change like this in any -stable tree.  It's
just a cleanup, right?  If it's a prerequisite for the fix we like for an
"important" bug, then that's a different story.  In its own right, it's
clearly not appropriate for backporting.


Thanks,
Roland
--

From: KOSAKI Motohiro
Date: Tuesday, October 26, 2010 - 6:04 am

Because [4/4] depend on [3/4] and I hope to backport it. Do you dislike it
too?


Thanks.



--

From: Roland McGrath
Date: Tuesday, October 26, 2010 - 6:18 am

Ah, OK.  That is indeed a fix for an important bug.  Not knowing the mm
code very well, I'm not in a position to judge whether it's safe enough
for a -stable stream or not.  If it is and it could be done safely
without relying on 3/4, that would seem safer to me, but it is not a
strong opinion.


Thanks,
Roland
--

From: KOSAKI Motohiro
Date: Sunday, October 24, 2010 - 8:29 pm

ChangeLog
 o since v2
   - Move ->in_exec_mm from task_struct to signal_struct
   - clean up oom_rss_swap_usage()
 o since v1
   - Always use thread group leader's ->in_exec_mm.
     It slightly makes efficient oom when a process has many thread.
   - Add the link of Brad's explanation to the description.


-----------------------------------------------------------
Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c

Because execve() makes new mm struct and setup stack and
copy argv. It mean the task have two mm while execve() temporary.
Unfortunately this nascent mm is not pointed any tasks, then
OOM-killer can't detect this memory usage. therefore OOM-killer
may kill incorrect task.

Thus, this patch added signal->in_exec_mm member and track
nascent mm usage.

Cc: stable@kernel.org
Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Eugene Teo <eteo@redhat.com>
Reported-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/compat.c             |    4 +++-
 fs/exec.c               |   16 +++++++++++++++-
 include/linux/binfmts.h |    1 +
 include/linux/sched.h   |    1 +
 mm/oom_kill.c           |   26 +++++++++++++++++++-------
 5 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 0644a15..a85b196 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1567,8 +1567,10 @@ int compat_do_execve(char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
+	if (bprm->mm) {
+		set_exec_mm(NULL);
 		mmput(bprm->mm);
+	}
 
 out_file:
 	if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 94dabd2..2395d10 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -347,6 +347,8 @@ int bprm_mm_init(struct linux_binprm *bprm)
 	if (err)
 		goto err;
 
+	set_exec_mm(mm);
+
 	return 0;
 
 err:
@@ -759,6 ...
From: pageexec
Date: Monday, October 25, 2010 - 4:28 am

On 25 Oct 2010 at 12:29, KOSAKI Motohiro wrote:

hi,


what happens when two (or more) threads in the same process call execve? the
above set_exec_mm calls will race (de_thread doesn't happen until much later
in execve) and overwrite each other's ->in_exec_mm which will still lead to
problems since there will be at most one temporary mm accounted for in the
oom killer.

[update: since i don't seem to have been cc'd on the other patch that
serializes execve, the above point is moot ;)]

worse, even if each temporary mm was tracked separately there'd still be a
race where the oom killer can get triggered with the culprit thread long
gone (and reset ->in_exec_mm) and never to be found, so the oom killer would
find someone else as guilty.

now all this leads me to suggest a simpler solution, at least for the first
problem mentioned above (i don't know what to do with the second one yet as
it seems to be a generic issue with the oom killer, probably it should verify
the oom situation once again after it took the task_list lock).

[update: while the serialized execve solves the first problem, i still think
that my idea is simpler and worth considering, so i leave it here even if for
just documentation purposes ;)]

given that all the oom killer needs from the mm struct is either ->total_pages
(in .35 and before, so be careful with the stable backport) or some ->rss_stat
counters, wouldn't it be much easier to simply transfer the bprm->mm counters
into current->mm for the duration of the execve (say, add them in get_arg_page
and remove them when bprm->mm is mmput in the do_execve failure path, etc)? the
transfer can be either to the existing counters or to new ones (obviously in
the latter case the oom code needs a small change to take the new counters into
account as well).

cheers,

 PaX Team

--

From: KOSAKI Motohiro
Date: Tuesday, October 26, 2010 - 12:25 am

Hi


patch 3/4 prevent this race :)
now, 3/4 move cred_guard_mutex into signal struct. and execve() take 

Ah, sorry. that's my mistake. I thought you've reviewed this one at
my last posting. 

can you please see 3/4? the URL is below.



As I said at previous discussion, It is possible and one of option. and I've
made the patch of this way too at once. But, It is messy than current. because
pages in nascent mm are also swappable. then, a swapping-out of such page need
to update both mm->rss_stat and nascent_mm->rss_stat. IOW, we need to change 
VM core. But, actually, execve vs OOM race is very rarely event, then, I don't 
hope to add some new branch and complexity.

Note: before 2.6.35, oom_kill.c track amount of process virtual address space.
then changing get_arg_page() is enough. but on 2.6.36 or later, oom_kill.c track
amount of process rss. then we can't ignore swap in/out event. and changing
get_arg_page() is not enough. Or, Do you propse new OOM account 
mm->rss + nascent_mm->total_vm? this can be easily. but tricky more.

So, I think this is one of trade-off issue. If you have better patch rather
than me, I'm glad to accept your one and join to review it. However myself 
don't plan to take this approach.


Thanks.



--

From: Oleg Nesterov
Date: Tuesday, November 23, 2010 - 7:34 am

Stupid question.

Can't we just account these allocations in the old -mm temporary?

IOW. Please look at the "patch" below. It is of course incomplete
and wrong (to the point inc_mm_counter() is not safe without
SPLIT_RSS_COUNTING), and copy_strings/flush_old_exec are not the
best places to play with mm-counters, just to explain what I mean.

It is very simple. copy_strings() increments MM_ANONPAGES every
time we add a new page into bprm->vma. This makes this memory
visible to select_bad_process().

When exec changes ->mm (or if it fails), we change MM_ANONPAGES
counter back.

Most probably I missed something, but what do you think?

Oleg.

--- x/include/linux/binfmts.h
+++ x/include/linux/binfmts.h
@@ -29,6 +29,7 @@ struct linux_binprm{
 	char buf[BINPRM_BUF_SIZE];
 #ifdef CONFIG_MMU
 	struct vm_area_struct *vma;
+	unsigned long mm_anonpages;
 #else
 # define MAX_ARG_PAGES	32
 	struct page *page[MAX_ARG_PAGES];
--- x/fs/exec.c
+++ x/fs/exec.c
@@ -457,6 +457,9 @@ static int copy_strings(int argc, const 
 					goto out;
 				}
 
+				bmrp->mm_anonpages--;
+				inc_mm_counter(current->mm, MM_ANONPAGES);
+
 				if (kmapped_page) {
 					flush_kernel_dcache_page(kmapped_page);
 					kunmap(kmapped_page);
@@ -1003,6 +1006,7 @@ int flush_old_exec(struct linux_binprm *
 	/*
 	 * Release all of the old mmap stuff
 	 */
+	add_mm_counter(current->mm, bprm->mm_anonpages);
 	retval = exec_mmap(bprm->mm);
 	if (retval)
 		goto out;
@@ -1426,8 +1430,10 @@ int do_execve(const char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
-		mmput (bprm->mm);
+	if (bprm->mm) {
+		add_mm_counter(current->mm, bprm->mm_anonpages);
+		mmput(bprm->mm);
+	}
 
 out_file:
 	if (bprm->file) {

--

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 5:24 pm

Because, If the pages of argv is swapping out when processing execve,
This accouing doesn't work.

Of cource, changing swapping-out logic is one of way. But I did hope
no VM core logic change. taking implict mlocking argv area during execve
is also one of option. But I did think implicit mlocking is more risky.

Is this enough explanation? Please don't hesitate say "no". If people
don't like my approach, I don't hesitate change my thinking.

Thanks.


--

From: Oleg Nesterov
Date: Wednesday, November 24, 2010 - 4:09 am

Why?

If copy_strings() inserts the new page into bprm->vma and then
this page is swapped out, inc_mm_counter(current->mm, MM_ANONPAGES)
becomes incorrect, yes. And we can't turn it into MM_SWAPENTS.

But does this really matter? oom_badness() counts MM_ANONPAGES +

Well, certainly I can't say no ;)

But it would be nice to find a more simple fix (if it can work,
of course).


And. I need a simple solution for the older kernels.

Oleg.

--

From: KOSAKI Motohiro
Date: Thursday, November 25, 2010 - 4:06 am

Ah, I got it. I did too strongly get stucked correct accounting. but
you mean it's not must.


Alright. It is certinally considerable one.


--

From: Oleg Nesterov
Date: Thursday, November 25, 2010 - 7:02 am

Yes. In fact, I _think_ this patch makes accounting better, even if
the extra MM_ANONPAGES numbers are not 100% correct.

Even if we add signal->in_exec_mm, nobody except oom_badness() will
look at it.

With this patch, say, /proc/pid/statm or /proc/pid/status will report
the memory allocated by the execing task. Even if technically this is
not correct (and 'swap' part may be wrong), this makes sense imho.
Otherwise, there is no way to see that this task allocates (may be
a lot) of memory.


Great! I'll send the patch tomorrow.

Even if you prefer another fix for 2.6.37/stable, I'd like to see
your review to know if it is correct or not (for backporting).

Oleg.

--

From: Oleg Nesterov
Date: Thursday, November 25, 2010 - 12:36 pm

OK, what do you think about the patch below?

Seems to work, with this patch the test-case doesn't kill the
system (sysctl_oom_kill_allocating_task == 0).

I didn't dare to change !CONFIG_MMU case, I do not know how to
test it.

The patch is not complete, compat_copy_strings() needs changes.
But, shouldn't it use get_arg_page() too? Otherwise, where do
we check RLIMIT_STACK?

The patch asks for the cleanups. In particular, I think exec_mmap()
should accept bprm, not mm. But I'd prefer to do this later.

Oleg.

 include/linux/binfmts.h |    1 +
 fs/exec.c               |   28 ++++++++++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

--- K/include/linux/binfmts.h~acct_exec_mem	2010-08-19 11:35:00.000000000 +0200
+++ K/include/linux/binfmts.h	2010-11-25 20:19:33.000000000 +0100
@@ -33,6 +33,7 @@ struct linux_binprm{
 # define MAX_ARG_PAGES	32
 	struct page *page[MAX_ARG_PAGES];
 #endif
+	unsigned long vma_pages;
 	struct mm_struct *mm;
 	unsigned long p; /* current top of mem */
 	unsigned int
--- K/fs/exec.c~acct_exec_mem	2010-11-25 15:16:56.000000000 +0100
+++ K/fs/exec.c	2010-11-25 20:20:49.000000000 +0100
@@ -162,6 +162,25 @@ out:
   	return error;
 }
 
+static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+{
+	struct mm_struct *mm = current->mm;
+	long diff = pages - bprm->vma_pages;
+
+	if (!mm || !diff)
+		return;
+
+	bprm->vma_pages += diff;
+
+#ifdef SPLIT_RSS_COUNTING
+	add_mm_counter(mm, MM_ANONPAGES, diff);
+#else
+	spin_lock(&mm->page_table_lock);
+	add_mm_counter(mm, MM_ANONPAGES, diff);
+	spin_unlock(&mm->page_table_lock);
+#endif
+}
+
 #ifdef CONFIG_MMU
 
 static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
@@ -186,6 +205,8 @@ static struct page *get_arg_page(struct 
 		unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;
 		struct rlimit *rlim;
 
+		acct_arg_size(bprm, size / PAGE_SIZE);
+
 		/*
 		 * We've historically supported up to 32 pages ...
From: KOSAKI Motohiro
Date: Sunday, November 28, 2010 - 10:25 pm

Because NOMMU doesn't have variable length argv. Instead it is still
using MAX_ARG_STRLEN as old MMU code.

32 pages hard coded argv limitation naturally prevent this nascent mm





Why do we need this unacct here? I mean 1) if exec_mmap() is success,
we don't need unaccount at all 2) if exec_mmap() is failure, an epilogue of



--

From: Oleg Nesterov
Date: Monday, November 29, 2010 - 4:33 am

Ah, I didn't mean NOMMU. I meant compat_execve()->compat_copy_strings().



Well it does, to revert the MM_ANONPAGES counter. I'll add the empty


Yes, we already killed all sub-threads. But this doesn't mean nobody
else can use current->mm, think about CLONE_VM. The simplest example

Yes.

Thanks Kosaki!

I'll resend v2 today. I am still not sure about compat_copy_strings()...

Oleg.

--

From: Oleg Nesterov
Date: Monday, November 29, 2010 - 11:23 am

OK, please see below, just for your review.


Yes, I think it needs the same checks. It should use get_arg_page()
or we need more copy-and-paste code, I think it should also check
fatal_signal_pending() like copy_strings() does.

I was going to export get_arg_page/acct_arg_size, but it is so
ugly. I'll try to find the way to unify copy_strings and
compat_copy_strings, not sure it is possible to do cleanly.

Probably this needs a separate patch in any case.

Oleg.

Changes:

	- move acct_arg_size() under CONFIG_MMU

	- add the "nop" version for NOMMMU

 include/linux/binfmts.h |    1 +
 fs/exec.c               |   32 ++++++++++++++++++++++++++++++--
 2 files changed, 31 insertions(+), 2 deletions(-)

--- K/include/linux/binfmts.h~acct_exec_mem	2010-08-19 11:35:00.000000000 +0200
+++ K/include/linux/binfmts.h	2010-11-29 17:29:35.000000000 +0100
@@ -29,6 +29,7 @@ struct linux_binprm{
 	char buf[BINPRM_BUF_SIZE];
 #ifdef CONFIG_MMU
 	struct vm_area_struct *vma;
+	unsigned long vma_pages;
 #else
 # define MAX_ARG_PAGES	32
 	struct page *page[MAX_ARG_PAGES];
--- K/fs/exec.c~acct_exec_mem	2010-11-25 15:16:56.000000000 +0100
+++ K/fs/exec.c	2010-11-29 17:51:43.000000000 +0100
@@ -164,6 +164,25 @@ out:
 
 #ifdef CONFIG_MMU
 
+static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+{
+	struct mm_struct *mm = current->mm;
+	long diff = (long)(pages - bprm->vma_pages);
+
+	if (!mm || !diff)
+		return;
+
+	bprm->vma_pages = pages;
+
+#ifdef SPLIT_RSS_COUNTING
+	add_mm_counter(mm, MM_ANONPAGES, diff);
+#else
+	spin_lock(&mm->page_table_lock);
+	add_mm_counter(mm, MM_ANONPAGES, diff);
+	spin_unlock(&mm->page_table_lock);
+#endif
+}
+
 static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 		int write)
 {
@@ -186,6 +205,8 @@ static struct page *get_arg_page(struct 
 		unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;
 		struct rlimit *rlim;
 
+		acct_arg_size(bprm, size / PAGE_SIZE);
+
 		/*
 		 * ...
From: Oleg Nesterov
Date: Tuesday, November 30, 2010 - 12:54 pm

But I think this is the only option for 2.6.37/stable.

So. I am sending 2 patches, hopefully they fix the problems

I'll send the cleanups which unify compat/non-compat code on
top of these fixes, this is not stable material.

Oleg.

--

From: Oleg Nesterov
Date: Tuesday, November 30, 2010 - 12:55 pm

Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c

execve()->copy_strings() can allocate a lot of memory, but
this is not visible to oom-killer, nobody can see the nascent
bprm->mm and take it into account.

With this patch get_arg_page() increments current's MM_ANONPAGES
counter every time we allocate the new page for argv/envp. When
do_execve() succeds or fails, we change this counter back.

Technically this is not 100% correct, we can't know if the new
page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but
I don't think this really matters and everything becomes correct
once exec changes ->mm or fails.

Reported-by: Brad Spengler <spender@grsecurity.net>
By-discussion-with: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/binfmts.h |    1 +
 fs/exec.c               |   32 ++++++++++++++++++++++++++++++--
 2 files changed, 31 insertions(+), 2 deletions(-)

--- K/include/linux/binfmts.h~acct_exec_mem	2010-11-30 18:27:15.000000000 +0100
+++ K/include/linux/binfmts.h	2010-11-30 18:28:54.000000000 +0100
@@ -29,6 +29,7 @@ struct linux_binprm{
 	char buf[BINPRM_BUF_SIZE];
 #ifdef CONFIG_MMU
 	struct vm_area_struct *vma;
+	unsigned long vma_pages;
 #else
 # define MAX_ARG_PAGES	32
 	struct page *page[MAX_ARG_PAGES];
--- K/fs/exec.c~acct_exec_mem	2010-11-30 18:27:15.000000000 +0100
+++ K/fs/exec.c	2010-11-30 18:28:54.000000000 +0100
@@ -164,6 +164,25 @@ out:
 
 #ifdef CONFIG_MMU
 
+static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+{
+	struct mm_struct *mm = current->mm;
+	long diff = (long)(pages - bprm->vma_pages);
+
+	if (!mm || !diff)
+		return;
+
+	bprm->vma_pages = pages;
+
+#ifdef SPLIT_RSS_COUNTING
+	add_mm_counter(mm, MM_ANONPAGES, diff);
+#else
+	spin_lock(&mm->page_table_lock);
+	add_mm_counter(mm, MM_ANONPAGES, ...
From: KOSAKI Motohiro
Date: Tuesday, November 30, 2010 - 5:12 pm

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



One minor request.

I guess this function can easily makes confusing to a code reader. So I
hope you write small function comments. describe to
 - What is oom nascent issue




--

From: Oleg Nesterov
Date: Wednesday, December 1, 2010 - 11:07 am

Agreed, this needs a comment.

The patch was already applied, I'll send a separate one on top
of the next "unify exec/compat" series. Or, I'll add the comments
into this series, depending on review.

Thanks,

Oleg.

--

From: Oleg Nesterov
Date: Tuesday, November 30, 2010 - 12:56 pm

Note: this patch targets 2.6.37 and tries to be as simple as possible.
That is why it adds more copy-and-paste horror into fs/compat.c and
uglifies fs/exec.c, this will be cleanuped later.

compat_copy_strings() plays with bprm->vma/mm directly and thus has
two problems: it lacks the RLIMIT_STACK check and argv/envp memory
is not visible to oom killer.

Export acct_arg_size() and get_arg_page(), change compat_copy_strings()
to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0)
as do_execve() does.

Add the fatal_signal_pending/cond_resched checks into compat_count() and
compat_copy_strings(), this matches the code in fs/exec.c and certainly
makes sense.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/binfmts.h |    4 ++++
 fs/exec.c               |    8 ++++----
 fs/compat.c             |   28 +++++++++++++++-------------
 3 files changed, 23 insertions(+), 17 deletions(-)

--- K/include/linux/binfmts.h~compat_get_arg_page	2010-11-30 18:28:54.000000000 +0100
+++ K/include/linux/binfmts.h	2010-11-30 18:30:45.000000000 +0100
@@ -60,6 +60,10 @@ struct linux_binprm{
 	unsigned long loader, exec;
 };
 
+extern void acct_arg_size(struct linux_binprm *bprm, unsigned long pages);
+extern struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
+					int write);
+
 #define BINPRM_FLAGS_ENFORCE_NONDUMP_BIT 0
 #define BINPRM_FLAGS_ENFORCE_NONDUMP (1 << BINPRM_FLAGS_ENFORCE_NONDUMP_BIT)
 
--- K/fs/exec.c~compat_get_arg_page	2010-11-30 18:28:54.000000000 +0100
+++ K/fs/exec.c	2010-11-30 18:30:45.000000000 +0100
@@ -164,7 +164,7 @@ out:
 
 #ifdef CONFIG_MMU
 
-static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
 {
 	struct mm_struct *mm = current->mm;
 	long diff = (long)(pages - bprm->vma_pages);
@@ -183,7 +183,7 @@ static void acct_arg_size(struct linux_b
 #endif
 }
 
-static struct page *get_arg_page(struct linux_binprm *bprm, ...
From: KOSAKI Motohiro
Date: Tuesday, November 30, 2010 - 8:04 pm

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--

From: Oleg Nesterov
Date: Tuesday, November 30, 2010 - 1:00 pm

(remove stable)


On top of

	[PATCH 1/2] exec: make argv/envp memory visible to oom-killer
	[PATCH 2/2] exec: copy-and-paste the fixes into compat_do_execve() paths

Imho, execve code in fs/compat.c must die. It is very hard to
maintain this copy-and-paste horror.

Oleg.

--

From: Oleg Nesterov
Date: Tuesday, November 30, 2010 - 1:01 pm

No functional changes, preparation to simplify the review.

And the new (and currently unused) "bool compat" argument to
get_arg_ptr(), count(), and copy_strings().

Add this argument to do_execve() as well, and rename it to
do_execve_common().

Reintroduce do_execve() as a trivial wrapper() on top of
do_execve_common(compat => false).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 fs/exec.c |   33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

--- K/fs/exec.c~2_is_compat_arg	2010-11-30 19:14:54.000000000 +0100
+++ K/fs/exec.c	2010-11-30 19:47:24.000000000 +0100
@@ -391,7 +391,7 @@ err:
 }
 
 static const char __user *
-get_arg_ptr(const char __user * const __user *argv, int argc)
+get_arg_ptr(const char __user * const __user *argv, int argc, bool compat)
 {
 	const char __user *ptr;
 
@@ -404,13 +404,13 @@ get_arg_ptr(const char __user * const __
 /*
  * count() counts the number of strings in array ARGV.
  */
-static int count(const char __user * const __user * argv, int max)
+static int count(const char __user * const __user *argv, int max, bool compat)
 {
 	int i = 0;
 
 	if (argv != NULL) {
 		for (;;) {
-			const char __user *p = get_arg_ptr(argv, i);
+			const char __user *p = get_arg_ptr(argv, i, compat);
 
 			if (!p)
 				break;
@@ -435,7 +435,7 @@ static int count(const char __user * con
  * ensures the destination page is created and not swapped out.
  */
 static int copy_strings(int argc, const char __user *const __user *argv,
-			struct linux_binprm *bprm)
+			struct linux_binprm *bprm, bool compat)
 {
 	struct page *kmapped_page = NULL;
 	char *kaddr = NULL;
@@ -448,7 +448,7 @@ static int copy_strings(int argc, const 
 		unsigned long pos;
 
 		ret = -EFAULT;
-		str = get_arg_ptr(argv, argc);
+		str = get_arg_ptr(argv, argc, compat);
 		if (IS_ERR(str))
 			goto out;
 
@@ -531,7 +531,8 @@ int copy_strings_kernel(int argc, const 
 	int r;
 	mm_segment_t oldfs = get_fs();
 ...
From: Oleg Nesterov
Date: Tuesday, November 30, 2010 - 1:00 pm

Introduce get_arg_ptr() helper, convert count() and copy_strings()
to use it.

No functional changes, preparation. This helper is trivial, it just
reads the pointer from argv/envp user-space array.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 fs/exec.c |   36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

--- K/fs/exec.c~1_get_arg_ptr	2010-11-30 18:30:45.000000000 +0100
+++ K/fs/exec.c	2010-11-30 19:14:54.000000000 +0100
@@ -390,6 +390,17 @@ err:
 	return err;
 }
 
+static const char __user *
+get_arg_ptr(const char __user * const __user *argv, int argc)
+{
+	const char __user *ptr;
+
+	if (get_user(ptr, argv + argc))
+		return ERR_PTR(-EFAULT);
+
+	return ptr;
+}
+
 /*
  * count() counts the number of strings in array ARGV.
  */
@@ -399,13 +410,14 @@ static int count(const char __user * con
 
 	if (argv != NULL) {
 		for (;;) {
-			const char __user * p;
+			const char __user *p = get_arg_ptr(argv, i);
 
-			if (get_user(p, argv))
-				return -EFAULT;
 			if (!p)
 				break;
-			argv++;
+
+			if (IS_ERR(p))
+				return -EFAULT;
+
 			if (i++ >= max)
 				return -E2BIG;
 
@@ -435,16 +447,18 @@ static int copy_strings(int argc, const 
 		int len;
 		unsigned long pos;
 
-		if (get_user(str, argv+argc) ||
-				!(len = strnlen_user(str, MAX_ARG_STRLEN))) {
-			ret = -EFAULT;
+		ret = -EFAULT;
+		str = get_arg_ptr(argv, argc);
+		if (IS_ERR(str))
 			goto out;
-		}
 
-		if (!valid_arg_len(bprm, len)) {
-			ret = -E2BIG;
+		len = strnlen_user(str, MAX_ARG_STRLEN);
+		if (!len)
+			goto out;
+
+		ret = -E2BIG;
+		if (!valid_arg_len(bprm, len))
 			goto out;
-		}
 
 		/* We're going to work our way backwords. */
 		pos = bprm->p;

--

From: Oleg Nesterov
Date: Tuesday, November 30, 2010 - 1:01 pm

Teach get_arg_ptr() to handle compat = T case correctly.

This allows us to remove the compat_do_execve() code from fs/compat.c
and reimplement compat_do_execve() as the trivial wrapper on top of
do_execve_common(compat => true).

In fact, this fixes another (minor) bug. "compat_uptr_t str" can
overflow after "str += len" in compat_copy_strings() if a 64bit
application execs via sys32_execve().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 fs/exec.c   |   25 ++++++
 fs/compat.c |  235 ------------------------------------------------------------
 2 files changed, 25 insertions(+), 235 deletions(-)

--- K/fs/exec.c~3_use_compat	2010-11-30 19:47:24.000000000 +0100
+++ K/fs/exec.c	2010-11-30 20:15:11.000000000 +0100
@@ -55,6 +55,7 @@
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
+#include <linux/compat.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -395,6 +396,18 @@ get_arg_ptr(const char __user * const __
 {
 	const char __user *ptr;
 
+#ifdef CONFIG_COMPAT
+	if (unlikely(compat)) {
+		compat_uptr_t __user *a = (void __user*)argv;
+		compat_uptr_t p;
+
+		if (get_user(p, a + argc))
+			return ERR_PTR(-EFAULT);
+
+		return compat_ptr(p);
+	}
+#endif
+
 	if (get_user(ptr, argv + argc))
 		return ERR_PTR(-EFAULT);
 
@@ -1501,6 +1514,18 @@ int do_execve(const char *filename,
 	return do_execve_common(filename, argv, envp, regs, false);
 }
 
+#ifdef CONFIG_COMPAT
+int compat_do_execve(char * filename,
+	compat_uptr_t __user *argv,
+	compat_uptr_t __user *envp,
+	struct pt_regs * regs)
+{
+	return do_execve_common(filename,
+				(void __user*)argv, (void __user*)envp,
+				regs, true);
+}
+#endif
+
 void set_binfmt(struct linux_binfmt *new)
 {
 	struct mm_struct *mm = current->mm;
--- K/fs/compat.c~3_use_compat	2010-11-30 18:30:45.000000000 +0100
+++ K/fs/compat.c	2010-11-30 20:17:28.000000000 +0100
@@ -1330,241 +1330,6 @@ compat_sys_openat(unsigned int dfd, cons
 	return do_sys_open(dfd, ...
From: Milton Miller
Date: Wednesday, December 1, 2010 - 10:37 am

This should not be marked unlikely.  Unlikely tells gcc the path
with over 99% confidence and disables branch predictors on some
architectures.  If called from a compat processes this will result
in a mispredicted branch every iteration.  Just use if (compat)


Thanks,
milton
--

From: Oleg Nesterov
Date: Wednesday, December 1, 2010 - 11:27 am

This applies to almost every likely/unlikely, and I think that compat
processes should fall into "unlikely category". But I don't really mind,

I'll recheck, but I don't think so. Please note that compat_ptr()
accepts "compat_uptr_t", not "compat_uptr_t *".

argv should be correct as a pointer to user-space, otherwise the
current code is buggy. For example, compat_do_execve() passes
argv to compat_count() which does get_user(argv) without any
conversion.

IOW, even if this should be fixed, I think this have nothing to
do with this patch. But I'll recheck, thanks.

Oleg.

--

From: Oleg Nesterov
Date: Tuesday, November 30, 2010 - 1:01 pm

Unexport acct_arg_size() and get_arg_page(), fs/compat.c doesn't
need them any longer.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/binfmts.h |    4 ----
 fs/exec.c               |    8 ++++----
 2 files changed, 4 insertions(+), 8 deletions(-)

--- K/include/linux/binfmts.h~4_unexport_arg_helpers	2010-11-30 18:30:45.000000000 +0100
+++ K/include/linux/binfmts.h	2010-11-30 20:38:13.000000000 +0100
@@ -60,10 +60,6 @@ struct linux_binprm{
 	unsigned long loader, exec;
 };
 
-extern void acct_arg_size(struct linux_binprm *bprm, unsigned long pages);
-extern struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
-					int write);
-
 #define BINPRM_FLAGS_ENFORCE_NONDUMP_BIT 0
 #define BINPRM_FLAGS_ENFORCE_NONDUMP (1 << BINPRM_FLAGS_ENFORCE_NONDUMP_BIT)
 
--- K/fs/exec.c~4_unexport_arg_helpers	2010-11-30 20:15:11.000000000 +0100
+++ K/fs/exec.c	2010-11-30 20:38:13.000000000 +0100
@@ -165,7 +165,7 @@ out:
 
 #ifdef CONFIG_MMU
 
-void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
 {
 	struct mm_struct *mm = current->mm;
 	long diff = (long)(pages - bprm->vma_pages);
@@ -184,7 +184,7 @@ void acct_arg_size(struct linux_binprm *
 #endif
 }
 
-struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
+static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 		int write)
 {
 	struct page *page;
@@ -298,11 +298,11 @@ static bool valid_arg_len(struct linux_b
 
 #else
 
-void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+static inline void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
 {
 }
 
-struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
+static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 		int write)
 {
 	struct page *page;

--

From: KOSAKI Motohiro
Date: Tuesday, November 30, 2010 - 8:09 pm

I strongly like this series. (yes, I made fault to forgot to change compat.c
multiple times ;)

Unfortunatelly, this is a bit large and I have no time now. I expect I
can review this at this or next weekend.....
Hopefully, anyoneelse will review this and ignore me....



--

From: KOSAKI Motohiro
Date: Monday, November 29, 2010 - 5:06 pm

From: David Rientjes
Date: Monday, October 25, 2010 - 1:37 pm

NACK.

Same response as the previous three times this patch has been proposed:

	http://marc.info/?t=128461666500001
	http://marc.info/?t=128324705200002
	http://marc.info/?t=128272938200002
--

Previous thread: [resend][PATCH] mm: increase RECLAIM_DISTANCE to 30 by KOSAKI Motohiro on Sunday, October 24, 2010 - 8:24 pm. (2 messages)

Next thread: linux-next: Tree for October 25 by Stephen Rothwell on Sunday, October 24, 2010 - 8:58 pm. (14 messages)