Current oom_score_adj is completely broken because It is strongly bound google usecase and ignore other all. 1) Priority inversion As kamezawa-san pointed out, This break cgroup and lxr environment. He said, > Assume 2 proceses A, B which has oom_score_adj of 300 and 0 > And A uses 200M, B uses 1G of memory under 4G system > > Under the system. > A's socre = (200M *1000)/4G + 300 = 350 > B's score = (1G * 1000)/4G = 250. > > In the cpuset, it has 2G of memory. > A's score = (200M * 1000)/2G + 300 = 400 > B's socre = (1G * 1000)/2G = 500 > > This priority-inversion don't happen in current system. 2) Ratio base point don't works large machine oom_score_adj normalize oom-score to 0-1000 range. but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean 1GB. this is no suitable for tuning parameter. As I said, proposional value oriented tuning parameter has scalability risk. 3) No reason to implement ABI breakage. old tuning parameter mean) oom-score = oom-base-score x 2^oom_adj new tuning parameter mean) oom-score = oom-base-score + oom_score_adj / (totalram + totalswap) but "oom_score_adj / (totalram + totalswap)" can be calculated in userland too. beucase both totalram and totalswap has been exporsed by /proc. So no reason to introduce funny new equation. 4) totalram based normalization assume flat memory model. example, the machine is assymmetric numa. fat node memory and thin node memory might have another wight value. In other word, totalram based priority is a one of policy. Fixed and workload depended policy shouldn't be embedded in kernel. probably. Then, this patch remove *UGLY* total_pages suck completely. Googler can calculate it at userland! Cc: stable@kernel.org Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- fs/proc/base.c | 35 +++---------- include/linux/oom.h | 16 +----- include/linux/sched.h | 2 +- mm/oom_kill.c | 142 ...
oom_adj is not only used for kernel knob, but also used for application interface. Then, adding new knob is no good reason to deprecate it. Don't do stupid! Also, after former patch, oom_score_adj can't be used for setting OOM_DISABLE. We need "echo -17 > /proc/<pid>/oom_adj" thing. This reverts commit 51b1bd2ace1595b72956224deda349efa880b693. Cc: stable@kernel.org Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- Documentation/feature-removal-schedule.txt | 25 ------------------------- Documentation/filesystems/proc.txt | 3 --- fs/proc/base.c | 8 -------- include/linux/oom.h | 3 --- 4 files changed, 0 insertions(+), 39 deletions(-) diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 9961f15..1cba5b8 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -151,31 +151,6 @@ Who: Eric Biederman <ebiederm@xmission.com> --------------------------- -What: /proc/<pid>/oom_adj -When: August 2012 -Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's - badness heuristic used to determine which task to kill when the kernel - is out of memory. - - The badness heuristic has since been rewritten since the introduction of - this tunable such that its meaning is deprecated. The value was - implemented as a bitshift on a score generated by the badness() - function that did not have any precise units of measure. With the - rewrite, the score is given as a proportion of available memory to the - task allocating pages, so using a bitshift which grows the score - exponentially is, thus, impossible to tune with fine granularity. - - A much more powerful interface, /proc/<pid>/oom_score_adj, was - introduced with the oom killer rewrite that allows users to increase or - decrease the badness() score linearly. This interface will ...
NACK as a logical follow-up to my NACK for "oom: remove totalpage normalization from oom_badness()" --
Huh? I requested you show us justification. BUT YOU DIDNT. If you have any usecase, show us RIGHT NOW. Don't speaking sucking crap. If you don't want ignore you. you are busy than you. DONT STUPID. --
The new tunable added in 2.6.36, /proc/pid/oom_score_adj, is necessary for the units that the badness score now uses. We need a tunable with a much higher resolution than the oom_adj scale from -16 to +15, and one that scales linearly as opposed to exponentially. Since that tunable is much more powerful than the oom_adj implementation, which never made any real sense for defining oom killing priority for any purpose other than polarization, the old tunable is deprecated for two years. --
The reason that you ware NAKed was not to introduce new powerful feature. You haven't tested your patch at all. Distro's initram script are using oom_adj interface and latest kernel show pointless warnings "/proc/xx/oom_adj is deprecated, please use /proc/xx/oom_score_adj instead." at _every_ boot time. As I said, DON'T SEND UNTESTED PATCH! DON'T BREAK USERLAND CARELESSLY! --
Linux users who care about prioritizing tasks for oom kill with a tunable that (1) has a unit, (2) has a higher resolution, and (3) is linear and not exponential. Memcg doesn't solve this issue without incurring a 1% No, it doesn't, and you completely and utterly failed to show a single usecase that broke as a result of this because nobody can currently use oom_adj for anything other than polarization. Thus, there's no backwards Yes, I've tested it, and it deprecates the tunable as expected. A single warning message serves the purpose well: let users know one time without being overly verbose that the tunable is deprecated and give them sufficient time (2 years) to start using the new tunable. That's how deprecation is done. --
No. Majority user don't care. You only talk about your case. Don't ignore Look at a real. All major distributions has already turn on memcg. End user don't need No. I showed. 1) Google code search showed some application are using this feature. http://www.google.com/codesearch?as_q=oom_adj&btnG=Search+Code&hl=ja&as_pa... 2) Not body use oom_adj other than polarization even though there are a few. example, kde are using. http://www.google.com/codesearch/p?hl=ja#MPJuLvSvNYM/pub/kde/unstable/snapshots/kdelib... When you are talking polarization issue, you blind a real. Don't talk your dream. 3) udev are using this feature. It's one of major linux component and you broke. http://www.google.com/codesearch/p?hl=ja#KVTjzuVpblQ/pub/linux/utils/kernel/hotplug/ud... You don't have to break our userland. you can't rewrite or deprecate no sense. Why do their application need to rewrite for *YOU*? Okey, you will got benefit from your new knob. But NOBDOY use the new one. and People need to rewrite their application even though no benefit. Don't do selfish userland breakage! --
And you said you ignore bug even though you have seen it. It suck! --
At v2.6.36-rc1, oom-killer doesn't work at all because YOU BROKE. And I was working on fixing it. 2010-08-19 http://marc.info/?t=128223176900001&r=1&w=2 http://marc.info/?t=128221532700003&r=1&w=2 http://marc.info/?t=128221532500008&r=1&w=2 However, You submitted new crap before the fixing. 2010-08-15 http://marc.info/?t=128184669600001&r=1&w=2 If you tested mainline a bit, you could find the problem quickly. You should have fixed mainline kernel at first. Again, YOU HAVEN'T TESTED YOUR OWN PATCH AT ALL. --
This existed before my oom killer rewrite, it was only noticed because the Yes, tasklist_lock was dropped in a mismerge of my patches when posting Yes, if a task was racing between oom_kill_process() and oom_kill_task() and all threads had dropped its mm between calls then there was a NULL This isn't "crap", this is a necessary bit to ensure that tasks that share an ->mm with a task immune from kill aren't killed themselves since we can't free the memory. We came to the consensus that it would be better to count the tasks that are OOM_DISABLE in the mm_struct to avoid the Thanks for finding a couple fixes during the 2.6.36-rc1 when the rewrite was first merged, it's much appreciated! --
If they don't care, then they won't be using oom_adj, so you're point about it's deprecation is irrelevant. Other users do want a more powerful userspace interface with a unit and higher resolution (I am one of them), there's no requirement that those Memcg also has a command-line disabling option to avoid incurring this 1% oom_adj isn't removed, it's deprecated. These users are using a deprecated interface and have a few years to convert to using the new I don't understand what you're trying to say here, but the current users of oom_adj that aren't +15 or -16 (or OOM_DISABLE) are arbitrary based relative to other tasks such as +5, +10, etc. They don't have any semantics other than being arbitrarily relative because it doesn't work in That's incorrect, I didn't break anything by deprecating a tunable for a few years. oom_adj gets converted roughly into an equivalent (but linear) oom_score_adj. Unfortunately for your argument, you can't show a single example of a current oom_adj user that has a scientific calculation behind its value It's deprecated for a few years so users can gradually convert to the new tunable, it wasn't removed when the new one was introduced. A higher resolution tunable that scales linearly with a unit is an advantage for Linux (for the minority of users who care about oom killing priority beyond the heuristic) and I think a few years is enough time for users to do a simple conversion to the new tunable. --
Documentation/ABI/obsolete/ should have all obsoletes in it. Alan --
Good point, the only documentation right now is in Documentation/feature-removal-schedule.txt and in the kernel log the first time oom_adj is written. I'll generate a patch, thanks! --
/proc/pid/oom_adj was deprecated in August 2010 with the introduction of the new oom killer heuristic. This patch copies the Documentation/feature-removal-schedule.txt entry for this tunable to the Documentation/ABI/obsolete directory so nobody misses it. Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/ABI/obsolete/proc-pid-oom_adj | 22 ++++++++++++++++++++++ 1 files changed, 22 insertions(+), 0 deletions(-) create mode 100644 Documentation/ABI/obsolete/proc-pid-oom_adj diff --git a/Documentation/ABI/obsolete/proc-pid-oom_adj b/Documentation/ABI/obsolete/proc-pid-oom_adj new file mode 100644 --- /dev/null +++ b/Documentation/ABI/obsolete/proc-pid-oom_adj @@ -0,0 +1,22 @@ +What: /proc/<pid>/oom_adj +When: August 2012 +Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's + badness heuristic used to determine which task to kill when the kernel + is out of memory. + + The badness heuristic has since been rewritten since the introduction of + this tunable such that its meaning is deprecated. The value was + implemented as a bitshift on a score generated by the badness() + function that did not have any precise units of measure. With the + rewrite, the score is given as a proportion of available memory to the + task allocating pages, so using a bitshift which grows the score + exponentially is, thus, impossible to tune with fine granularity. + + A much more powerful interface, /proc/<pid>/oom_score_adj, was + introduced with the oom killer rewrite that allows users to increase or + decrease the badness() score linearly. This interface will replace + /proc/<pid>/oom_adj. + + A warning will be emitted to the kernel log if an application uses this + deprecated interface. After it is printed once, future warnings will be + suppressed until the kernel is rebooted. --
NAK. You seems to think shouting claim makes some effect. but It's incorrect. Incorrect. oom_adj and oom_score_adj have different concept and different abstraction. --
The tunable is deprecated. If you are really that concerned about the existing users who you don't think can convert in the next two years, why don't you help them convert? That fixes the issue, but you're not interested in that. I offered to convert any open-source users you can correct even when the kernel is obviously moving in a different direction. Others may have a different opinion of who is being childish in this whole ordeal. --
Why don't you change by _your_ hand? _Usually_ userland software changed at first _by_ who wanted the change. Example, we fujitsu changed elf core file format when vma are >65536, but It was not made any breakage. we changed gdb, binutils, elfutils and etc etc --
No irrelevant. Your patch break their environment even though Even if you don't understand, they are IN THE WORLD. you don't have to no sense. --
The _only_ difference too oom_adj since the rewrite is that it is now mapped on a linear scale rather than an exponential scale. That's because the heuristic itself has a defined range [0, 1000] that characterizes the memory usage of the application it is ranking. To show any breakge, you would have to show how oom_adj values being used by applications are based on a calculated value that prioritizes those tasks amongst each other. With the exponential scale, that's nearly impossible because of the number of arbitrary heuristics that were used before oom_adj were considered (runtime, nice level, CAP_SYS_RAWIO, etc). So don't talk about userspace breakage when you can't even describe it or present a single usecase. --
But, No people have agreed your powerfulness even though you talked about the same explanation a lot of times. Again, IF you need to [0 .. 1000] range, you can calculate it by your application. current oom score can be get from /proc/pid/oom_score and total memory can be get from /proc/meminfo. You shouldn't have break Huh? Remember! your feature have ZERO user. --
Because NOTHING breaks with the new mapping. Eight months later since this was initially proposed on linux-mm, you still cannot show a single example that depended on the exponential mapping of oom_adj. I'm not going to continue responding to your criticism about this point since your That would require the userspace tunable to be adjusted anytime a task's mempolicy changes, its nodemask changes, it's cpuset attachment changes, its mems change, a memcg limit changes, etc. The only constant is the task's priority, and the current oom_score_adj implementation preserves that unless explicitly changed later by the user. I completely understand that you may not have a use for this. --
All situation can be calculated on userland. User process can be know --
Nothing breaks. If something did, you could respond to my answer above and provide a single example of a real-world example that broke as a Yes, but the proportional priority-based oom_score_adj values allow users to avoid recalculating and writing that value anytime a mempolicy attachment changes, its nodemask changes, it moves to another cpuset, its set of mems changes, its memcg attachment changes, its limit is modiifed, etc. --
Changelog
o since v1
- function comment also change current->cred_guard_mutex to
current->signal->cred_guard_mutex.
---------------------------------------------------------------------------
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
execve() has no worth.
Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.
Cc: stable@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
fs/exec.c | 10 +++++-----
fs/proc/base.c | 8 ++++----
include/linux/init_task.h | 4 ++--
include/linux/sched.h | 7 ++++---
include/linux/tracehook.h | 2 +-
kernel/cred.c | 4 +---
kernel/fork.c | 2 ++
kernel/ptrace.c | 4 ++--
8 files changed, 21 insertions(+), 20 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 6d2b6f9..94dabd2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1078,14 +1078,14 @@ EXPORT_SYMBOL(setup_new_exec);
*/
int prepare_bprm_creds(struct linux_binprm *bprm)
{
- if (mutex_lock_interruptible(&current->cred_guard_mutex))
+ if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
- mutex_unlock(&current->cred_guard_mutex);
+ mutex_unlock(&current->signal->cred_guard_mutex);
return -ENOMEM;
}
@@ -1093,7 +1093,7 @@ void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
- mutex_unlock(&current->cred_guard_mutex);
+ mutex_unlock(&current->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
kfree(bprm);
@@ -1114,13 +1114,13 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be ...This has my ACK if Oleg doesn't see any problems. Thanks, Roland --
I believe the patch is fine (it already has my reviewed-by). Except: I am not sure about -stable. At least, this patch should not go into the <2.6.35 kernels, it relies on misc changes which changed the scope of task->signal. Before 2.6.35 almost any user of ->cred_guard_mutex can race with exit and hit ->signal == NULL. Oleg. --
I see no justification for a change like this in any -stable tree. It's just a cleanup, right? If it's a prerequisite for the fix we like for an "important" bug, then that's a different story. In its own right, it's clearly not appropriate for backporting. Thanks, Roland --
Because [4/4] depend on [3/4] and I hope to backport it. Do you dislike it too? Thanks. --
Ah, OK. That is indeed a fix for an important bug. Not knowing the mm code very well, I'm not in a position to judge whether it's safe enough for a -stable stream or not. If it is and it could be done safely without relying on 3/4, that would seem safer to me, but it is not a strong opinion. Thanks, Roland --
ChangeLog
o since v2
- Move ->in_exec_mm from task_struct to signal_struct
- clean up oom_rss_swap_usage()
o since v1
- Always use thread group leader's ->in_exec_mm.
It slightly makes efficient oom when a process has many thread.
- Add the link of Brad's explanation to the description.
-----------------------------------------------------------
Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c
Because execve() makes new mm struct and setup stack and
copy argv. It mean the task have two mm while execve() temporary.
Unfortunately this nascent mm is not pointed any tasks, then
OOM-killer can't detect this memory usage. therefore OOM-killer
may kill incorrect task.
Thus, this patch added signal->in_exec_mm member and track
nascent mm usage.
Cc: stable@kernel.org
Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Eugene Teo <eteo@redhat.com>
Reported-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
fs/compat.c | 4 +++-
fs/exec.c | 16 +++++++++++++++-
include/linux/binfmts.h | 1 +
include/linux/sched.h | 1 +
mm/oom_kill.c | 26 +++++++++++++++++++-------
5 files changed, 39 insertions(+), 9 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 0644a15..a85b196 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1567,8 +1567,10 @@ int compat_do_execve(char * filename,
return retval;
out:
- if (bprm->mm)
+ if (bprm->mm) {
+ set_exec_mm(NULL);
mmput(bprm->mm);
+ }
out_file:
if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 94dabd2..2395d10 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -347,6 +347,8 @@ int bprm_mm_init(struct linux_binprm *bprm)
if (err)
goto err;
+ set_exec_mm(mm);
+
return 0;
err:
@@ -759,6 ...On 25 Oct 2010 at 12:29, KOSAKI Motohiro wrote: hi, what happens when two (or more) threads in the same process call execve? the above set_exec_mm calls will race (de_thread doesn't happen until much later in execve) and overwrite each other's ->in_exec_mm which will still lead to problems since there will be at most one temporary mm accounted for in the oom killer. [update: since i don't seem to have been cc'd on the other patch that serializes execve, the above point is moot ;)] worse, even if each temporary mm was tracked separately there'd still be a race where the oom killer can get triggered with the culprit thread long gone (and reset ->in_exec_mm) and never to be found, so the oom killer would find someone else as guilty. now all this leads me to suggest a simpler solution, at least for the first problem mentioned above (i don't know what to do with the second one yet as it seems to be a generic issue with the oom killer, probably it should verify the oom situation once again after it took the task_list lock). [update: while the serialized execve solves the first problem, i still think that my idea is simpler and worth considering, so i leave it here even if for just documentation purposes ;)] given that all the oom killer needs from the mm struct is either ->total_pages (in .35 and before, so be careful with the stable backport) or some ->rss_stat counters, wouldn't it be much easier to simply transfer the bprm->mm counters into current->mm for the duration of the execve (say, add them in get_arg_page and remove them when bprm->mm is mmput in the do_execve failure path, etc)? the transfer can be either to the existing counters or to new ones (obviously in the latter case the oom code needs a small change to take the new counters into account as well). cheers, PaX Team --
Hi patch 3/4 prevent this race :) now, 3/4 move cred_guard_mutex into signal struct. and execve() take Ah, sorry. that's my mistake. I thought you've reviewed this one at my last posting. can you please see 3/4? the URL is below. As I said at previous discussion, It is possible and one of option. and I've made the patch of this way too at once. But, It is messy than current. because pages in nascent mm are also swappable. then, a swapping-out of such page need to update both mm->rss_stat and nascent_mm->rss_stat. IOW, we need to change VM core. But, actually, execve vs OOM race is very rarely event, then, I don't hope to add some new branch and complexity. Note: before 2.6.35, oom_kill.c track amount of process virtual address space. then changing get_arg_page() is enough. but on 2.6.36 or later, oom_kill.c track amount of process rss. then we can't ignore swap in/out event. and changing get_arg_page() is not enough. Or, Do you propse new OOM account mm->rss + nascent_mm->total_vm? this can be easily. but tricky more. So, I think this is one of trade-off issue. If you have better patch rather than me, I'm glad to accept your one and join to review it. However myself don't plan to take this approach. Thanks. --
Stupid question.
Can't we just account these allocations in the old -mm temporary?
IOW. Please look at the "patch" below. It is of course incomplete
and wrong (to the point inc_mm_counter() is not safe without
SPLIT_RSS_COUNTING), and copy_strings/flush_old_exec are not the
best places to play with mm-counters, just to explain what I mean.
It is very simple. copy_strings() increments MM_ANONPAGES every
time we add a new page into bprm->vma. This makes this memory
visible to select_bad_process().
When exec changes ->mm (or if it fails), we change MM_ANONPAGES
counter back.
Most probably I missed something, but what do you think?
Oleg.
--- x/include/linux/binfmts.h
+++ x/include/linux/binfmts.h
@@ -29,6 +29,7 @@ struct linux_binprm{
char buf[BINPRM_BUF_SIZE];
#ifdef CONFIG_MMU
struct vm_area_struct *vma;
+ unsigned long mm_anonpages;
#else
# define MAX_ARG_PAGES 32
struct page *page[MAX_ARG_PAGES];
--- x/fs/exec.c
+++ x/fs/exec.c
@@ -457,6 +457,9 @@ static int copy_strings(int argc, const
goto out;
}
+ bmrp->mm_anonpages--;
+ inc_mm_counter(current->mm, MM_ANONPAGES);
+
if (kmapped_page) {
flush_kernel_dcache_page(kmapped_page);
kunmap(kmapped_page);
@@ -1003,6 +1006,7 @@ int flush_old_exec(struct linux_binprm *
/*
* Release all of the old mmap stuff
*/
+ add_mm_counter(current->mm, bprm->mm_anonpages);
retval = exec_mmap(bprm->mm);
if (retval)
goto out;
@@ -1426,8 +1430,10 @@ int do_execve(const char * filename,
return retval;
out:
- if (bprm->mm)
- mmput (bprm->mm);
+ if (bprm->mm) {
+ add_mm_counter(current->mm, bprm->mm_anonpages);
+ mmput(bprm->mm);
+ }
out_file:
if (bprm->file) {
--
Because, If the pages of argv is swapping out when processing execve, This accouing doesn't work. Of cource, changing swapping-out logic is one of way. But I did hope no VM core logic change. taking implict mlocking argv area during execve is also one of option. But I did think implicit mlocking is more risky. Is this enough explanation? Please don't hesitate say "no". If people don't like my approach, I don't hesitate change my thinking. Thanks. --
Why? If copy_strings() inserts the new page into bprm->vma and then this page is swapped out, inc_mm_counter(current->mm, MM_ANONPAGES) becomes incorrect, yes. And we can't turn it into MM_SWAPENTS. But does this really matter? oom_badness() counts MM_ANONPAGES + Well, certainly I can't say no ;) But it would be nice to find a more simple fix (if it can work, of course). And. I need a simple solution for the older kernels. Oleg. --
Ah, I got it. I did too strongly get stucked correct accounting. but you mean it's not must. Alright. It is certinally considerable one. --
Yes. In fact, I _think_ this patch makes accounting better, even if the extra MM_ANONPAGES numbers are not 100% correct. Even if we add signal->in_exec_mm, nobody except oom_badness() will look at it. With this patch, say, /proc/pid/statm or /proc/pid/status will report the memory allocated by the execing task. Even if technically this is not correct (and 'swap' part may be wrong), this makes sense imho. Otherwise, there is no way to see that this task allocates (may be a lot) of memory. Great! I'll send the patch tomorrow. Even if you prefer another fix for 2.6.37/stable, I'd like to see your review to know if it is correct or not (for backporting). Oleg. --
OK, what do you think about the patch below?
Seems to work, with this patch the test-case doesn't kill the
system (sysctl_oom_kill_allocating_task == 0).
I didn't dare to change !CONFIG_MMU case, I do not know how to
test it.
The patch is not complete, compat_copy_strings() needs changes.
But, shouldn't it use get_arg_page() too? Otherwise, where do
we check RLIMIT_STACK?
The patch asks for the cleanups. In particular, I think exec_mmap()
should accept bprm, not mm. But I'd prefer to do this later.
Oleg.
include/linux/binfmts.h | 1 +
fs/exec.c | 28 ++++++++++++++++++++++++++--
2 files changed, 27 insertions(+), 2 deletions(-)
--- K/include/linux/binfmts.h~acct_exec_mem 2010-08-19 11:35:00.000000000 +0200
+++ K/include/linux/binfmts.h 2010-11-25 20:19:33.000000000 +0100
@@ -33,6 +33,7 @@ struct linux_binprm{
# define MAX_ARG_PAGES 32
struct page *page[MAX_ARG_PAGES];
#endif
+ unsigned long vma_pages;
struct mm_struct *mm;
unsigned long p; /* current top of mem */
unsigned int
--- K/fs/exec.c~acct_exec_mem 2010-11-25 15:16:56.000000000 +0100
+++ K/fs/exec.c 2010-11-25 20:20:49.000000000 +0100
@@ -162,6 +162,25 @@ out:
return error;
}
+static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+{
+ struct mm_struct *mm = current->mm;
+ long diff = pages - bprm->vma_pages;
+
+ if (!mm || !diff)
+ return;
+
+ bprm->vma_pages += diff;
+
+#ifdef SPLIT_RSS_COUNTING
+ add_mm_counter(mm, MM_ANONPAGES, diff);
+#else
+ spin_lock(&mm->page_table_lock);
+ add_mm_counter(mm, MM_ANONPAGES, diff);
+ spin_unlock(&mm->page_table_lock);
+#endif
+}
+
#ifdef CONFIG_MMU
static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
@@ -186,6 +205,8 @@ static struct page *get_arg_page(struct
unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;
struct rlimit *rlim;
+ acct_arg_size(bprm, size / PAGE_SIZE);
+
/*
* We've historically supported up to 32 pages ...Because NOMMU doesn't have variable length argv. Instead it is still using MAX_ARG_STRLEN as old MMU code. 32 pages hard coded argv limitation naturally prevent this nascent mm Why do we need this unacct here? I mean 1) if exec_mmap() is success, we don't need unaccount at all 2) if exec_mmap() is failure, an epilogue of --
Ah, I didn't mean NOMMU. I meant compat_execve()->compat_copy_strings(). Well it does, to revert the MM_ANONPAGES counter. I'll add the empty Yes, we already killed all sub-threads. But this doesn't mean nobody else can use current->mm, think about CLONE_VM. The simplest example Yes. Thanks Kosaki! I'll resend v2 today. I am still not sure about compat_copy_strings()... Oleg. --
OK, please see below, just for your review.
Yes, I think it needs the same checks. It should use get_arg_page()
or we need more copy-and-paste code, I think it should also check
fatal_signal_pending() like copy_strings() does.
I was going to export get_arg_page/acct_arg_size, but it is so
ugly. I'll try to find the way to unify copy_strings and
compat_copy_strings, not sure it is possible to do cleanly.
Probably this needs a separate patch in any case.
Oleg.
Changes:
- move acct_arg_size() under CONFIG_MMU
- add the "nop" version for NOMMMU
include/linux/binfmts.h | 1 +
fs/exec.c | 32 ++++++++++++++++++++++++++++++--
2 files changed, 31 insertions(+), 2 deletions(-)
--- K/include/linux/binfmts.h~acct_exec_mem 2010-08-19 11:35:00.000000000 +0200
+++ K/include/linux/binfmts.h 2010-11-29 17:29:35.000000000 +0100
@@ -29,6 +29,7 @@ struct linux_binprm{
char buf[BINPRM_BUF_SIZE];
#ifdef CONFIG_MMU
struct vm_area_struct *vma;
+ unsigned long vma_pages;
#else
# define MAX_ARG_PAGES 32
struct page *page[MAX_ARG_PAGES];
--- K/fs/exec.c~acct_exec_mem 2010-11-25 15:16:56.000000000 +0100
+++ K/fs/exec.c 2010-11-29 17:51:43.000000000 +0100
@@ -164,6 +164,25 @@ out:
#ifdef CONFIG_MMU
+static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+{
+ struct mm_struct *mm = current->mm;
+ long diff = (long)(pages - bprm->vma_pages);
+
+ if (!mm || !diff)
+ return;
+
+ bprm->vma_pages = pages;
+
+#ifdef SPLIT_RSS_COUNTING
+ add_mm_counter(mm, MM_ANONPAGES, diff);
+#else
+ spin_lock(&mm->page_table_lock);
+ add_mm_counter(mm, MM_ANONPAGES, diff);
+ spin_unlock(&mm->page_table_lock);
+#endif
+}
+
static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
int write)
{
@@ -186,6 +205,8 @@ static struct page *get_arg_page(struct
unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;
struct rlimit *rlim;
+ acct_arg_size(bprm, size / PAGE_SIZE);
+
/*
* ...But I think this is the only option for 2.6.37/stable. So. I am sending 2 patches, hopefully they fix the problems I'll send the cleanups which unify compat/non-compat code on top of these fixes, this is not stable material. Oleg. --
Brad Spengler published a local memory-allocation DoS that evades the OOM-killer (though not the virtual memory RLIMIT): http://www.grsecurity.net/~spender/64bit_dos.c execve()->copy_strings() can allocate a lot of memory, but this is not visible to oom-killer, nobody can see the nascent bprm->mm and take it into account. With this patch get_arg_page() increments current's MM_ANONPAGES counter every time we allocate the new page for argv/envp. When do_execve() succeds or fails, we change this counter back. Technically this is not 100% correct, we can't know if the new page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but I don't think this really matters and everything becomes correct once exec changes ->mm or fails. Reported-by: Brad Spengler <spender@grsecurity.net> By-discussion-with: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- include/linux/binfmts.h | 1 + fs/exec.c | 32 ++++++++++++++++++++++++++++++-- 2 files changed, 31 insertions(+), 2 deletions(-) --- K/include/linux/binfmts.h~acct_exec_mem 2010-11-30 18:27:15.000000000 +0100 +++ K/include/linux/binfmts.h 2010-11-30 18:28:54.000000000 +0100 @@ -29,6 +29,7 @@ struct linux_binprm{ char buf[BINPRM_BUF_SIZE]; #ifdef CONFIG_MMU struct vm_area_struct *vma; + unsigned long vma_pages; #else # define MAX_ARG_PAGES 32 struct page *page[MAX_ARG_PAGES]; --- K/fs/exec.c~acct_exec_mem 2010-11-30 18:27:15.000000000 +0100 +++ K/fs/exec.c 2010-11-30 18:28:54.000000000 +0100 @@ -164,6 +164,25 @@ out: #ifdef CONFIG_MMU +static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages) +{ + struct mm_struct *mm = current->mm; + long diff = (long)(pages - bprm->vma_pages); + + if (!mm || !diff) + return; + + bprm->vma_pages = pages; + +#ifdef SPLIT_RSS_COUNTING + add_mm_counter(mm, MM_ANONPAGES, diff); +#else + spin_lock(&mm->page_table_lock); + add_mm_counter(mm, MM_ANONPAGES, ...
Looks good to me. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> One minor request. I guess this function can easily makes confusing to a code reader. So I hope you write small function comments. describe to - What is oom nascent issue --
Agreed, this needs a comment. The patch was already applied, I'll send a separate one on top of the next "unify exec/compat" series. Or, I'll add the comments into this series, depending on review. Thanks, Oleg. --
Note: this patch targets 2.6.37 and tries to be as simple as possible.
That is why it adds more copy-and-paste horror into fs/compat.c and
uglifies fs/exec.c, this will be cleanuped later.
compat_copy_strings() plays with bprm->vma/mm directly and thus has
two problems: it lacks the RLIMIT_STACK check and argv/envp memory
is not visible to oom killer.
Export acct_arg_size() and get_arg_page(), change compat_copy_strings()
to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0)
as do_execve() does.
Add the fatal_signal_pending/cond_resched checks into compat_count() and
compat_copy_strings(), this matches the code in fs/exec.c and certainly
makes sense.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/binfmts.h | 4 ++++
fs/exec.c | 8 ++++----
fs/compat.c | 28 +++++++++++++++-------------
3 files changed, 23 insertions(+), 17 deletions(-)
--- K/include/linux/binfmts.h~compat_get_arg_page 2010-11-30 18:28:54.000000000 +0100
+++ K/include/linux/binfmts.h 2010-11-30 18:30:45.000000000 +0100
@@ -60,6 +60,10 @@ struct linux_binprm{
unsigned long loader, exec;
};
+extern void acct_arg_size(struct linux_binprm *bprm, unsigned long pages);
+extern struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
+ int write);
+
#define BINPRM_FLAGS_ENFORCE_NONDUMP_BIT 0
#define BINPRM_FLAGS_ENFORCE_NONDUMP (1 << BINPRM_FLAGS_ENFORCE_NONDUMP_BIT)
--- K/fs/exec.c~compat_get_arg_page 2010-11-30 18:28:54.000000000 +0100
+++ K/fs/exec.c 2010-11-30 18:30:45.000000000 +0100
@@ -164,7 +164,7 @@ out:
#ifdef CONFIG_MMU
-static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
{
struct mm_struct *mm = current->mm;
long diff = (long)(pages - bprm->vma_pages);
@@ -183,7 +183,7 @@ static void acct_arg_size(struct linux_b
#endif
}
-static struct page *get_arg_page(struct linux_binprm *bprm, ...Looks good to me. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --
(remove stable) On top of [PATCH 1/2] exec: make argv/envp memory visible to oom-killer [PATCH 2/2] exec: copy-and-paste the fixes into compat_do_execve() paths Imho, execve code in fs/compat.c must die. It is very hard to maintain this copy-and-paste horror. Oleg. --
No functional changes, preparation to simplify the review.
And the new (and currently unused) "bool compat" argument to
get_arg_ptr(), count(), and copy_strings().
Add this argument to do_execve() as well, and rename it to
do_execve_common().
Reintroduce do_execve() as a trivial wrapper() on top of
do_execve_common(compat => false).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/exec.c | 33 +++++++++++++++++++++------------
1 file changed, 21 insertions(+), 12 deletions(-)
--- K/fs/exec.c~2_is_compat_arg 2010-11-30 19:14:54.000000000 +0100
+++ K/fs/exec.c 2010-11-30 19:47:24.000000000 +0100
@@ -391,7 +391,7 @@ err:
}
static const char __user *
-get_arg_ptr(const char __user * const __user *argv, int argc)
+get_arg_ptr(const char __user * const __user *argv, int argc, bool compat)
{
const char __user *ptr;
@@ -404,13 +404,13 @@ get_arg_ptr(const char __user * const __
/*
* count() counts the number of strings in array ARGV.
*/
-static int count(const char __user * const __user * argv, int max)
+static int count(const char __user * const __user *argv, int max, bool compat)
{
int i = 0;
if (argv != NULL) {
for (;;) {
- const char __user *p = get_arg_ptr(argv, i);
+ const char __user *p = get_arg_ptr(argv, i, compat);
if (!p)
break;
@@ -435,7 +435,7 @@ static int count(const char __user * con
* ensures the destination page is created and not swapped out.
*/
static int copy_strings(int argc, const char __user *const __user *argv,
- struct linux_binprm *bprm)
+ struct linux_binprm *bprm, bool compat)
{
struct page *kmapped_page = NULL;
char *kaddr = NULL;
@@ -448,7 +448,7 @@ static int copy_strings(int argc, const
unsigned long pos;
ret = -EFAULT;
- str = get_arg_ptr(argv, argc);
+ str = get_arg_ptr(argv, argc, compat);
if (IS_ERR(str))
goto out;
@@ -531,7 +531,8 @@ int copy_strings_kernel(int argc, const
int r;
mm_segment_t oldfs = get_fs();
...Introduce get_arg_ptr() helper, convert count() and copy_strings()
to use it.
No functional changes, preparation. This helper is trivial, it just
reads the pointer from argv/envp user-space array.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/exec.c | 36 +++++++++++++++++++++++++-----------
1 file changed, 25 insertions(+), 11 deletions(-)
--- K/fs/exec.c~1_get_arg_ptr 2010-11-30 18:30:45.000000000 +0100
+++ K/fs/exec.c 2010-11-30 19:14:54.000000000 +0100
@@ -390,6 +390,17 @@ err:
return err;
}
+static const char __user *
+get_arg_ptr(const char __user * const __user *argv, int argc)
+{
+ const char __user *ptr;
+
+ if (get_user(ptr, argv + argc))
+ return ERR_PTR(-EFAULT);
+
+ return ptr;
+}
+
/*
* count() counts the number of strings in array ARGV.
*/
@@ -399,13 +410,14 @@ static int count(const char __user * con
if (argv != NULL) {
for (;;) {
- const char __user * p;
+ const char __user *p = get_arg_ptr(argv, i);
- if (get_user(p, argv))
- return -EFAULT;
if (!p)
break;
- argv++;
+
+ if (IS_ERR(p))
+ return -EFAULT;
+
if (i++ >= max)
return -E2BIG;
@@ -435,16 +447,18 @@ static int copy_strings(int argc, const
int len;
unsigned long pos;
- if (get_user(str, argv+argc) ||
- !(len = strnlen_user(str, MAX_ARG_STRLEN))) {
- ret = -EFAULT;
+ ret = -EFAULT;
+ str = get_arg_ptr(argv, argc);
+ if (IS_ERR(str))
goto out;
- }
- if (!valid_arg_len(bprm, len)) {
- ret = -E2BIG;
+ len = strnlen_user(str, MAX_ARG_STRLEN);
+ if (!len)
+ goto out;
+
+ ret = -E2BIG;
+ if (!valid_arg_len(bprm, len))
goto out;
- }
/* We're going to work our way backwords. */
pos = bprm->p;
--
Teach get_arg_ptr() to handle compat = T case correctly.
This allows us to remove the compat_do_execve() code from fs/compat.c
and reimplement compat_do_execve() as the trivial wrapper on top of
do_execve_common(compat => true).
In fact, this fixes another (minor) bug. "compat_uptr_t str" can
overflow after "str += len" in compat_copy_strings() if a 64bit
application execs via sys32_execve().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/exec.c | 25 ++++++
fs/compat.c | 235 ------------------------------------------------------------
2 files changed, 25 insertions(+), 235 deletions(-)
--- K/fs/exec.c~3_use_compat 2010-11-30 19:47:24.000000000 +0100
+++ K/fs/exec.c 2010-11-30 20:15:11.000000000 +0100
@@ -55,6 +55,7 @@
#include <linux/fs_struct.h>
#include <linux/pipe_fs_i.h>
#include <linux/oom.h>
+#include <linux/compat.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -395,6 +396,18 @@ get_arg_ptr(const char __user * const __
{
const char __user *ptr;
+#ifdef CONFIG_COMPAT
+ if (unlikely(compat)) {
+ compat_uptr_t __user *a = (void __user*)argv;
+ compat_uptr_t p;
+
+ if (get_user(p, a + argc))
+ return ERR_PTR(-EFAULT);
+
+ return compat_ptr(p);
+ }
+#endif
+
if (get_user(ptr, argv + argc))
return ERR_PTR(-EFAULT);
@@ -1501,6 +1514,18 @@ int do_execve(const char *filename,
return do_execve_common(filename, argv, envp, regs, false);
}
+#ifdef CONFIG_COMPAT
+int compat_do_execve(char * filename,
+ compat_uptr_t __user *argv,
+ compat_uptr_t __user *envp,
+ struct pt_regs * regs)
+{
+ return do_execve_common(filename,
+ (void __user*)argv, (void __user*)envp,
+ regs, true);
+}
+#endif
+
void set_binfmt(struct linux_binfmt *new)
{
struct mm_struct *mm = current->mm;
--- K/fs/compat.c~3_use_compat 2010-11-30 18:30:45.000000000 +0100
+++ K/fs/compat.c 2010-11-30 20:17:28.000000000 +0100
@@ -1330,241 +1330,6 @@ compat_sys_openat(unsigned int dfd, cons
return do_sys_open(dfd, ...This should not be marked unlikely. Unlikely tells gcc the path with over 99% confidence and disables branch predictors on some architectures. If called from a compat processes this will result in a mispredicted branch every iteration. Just use if (compat) Thanks, milton --
This applies to almost every likely/unlikely, and I think that compat processes should fall into "unlikely category". But I don't really mind, I'll recheck, but I don't think so. Please note that compat_ptr() accepts "compat_uptr_t", not "compat_uptr_t *". argv should be correct as a pointer to user-space, otherwise the current code is buggy. For example, compat_do_execve() passes argv to compat_count() which does get_user(argv) without any conversion. IOW, even if this should be fixed, I think this have nothing to do with this patch. But I'll recheck, thanks. Oleg. --
Unexport acct_arg_size() and get_arg_page(), fs/compat.c doesn't
need them any longer.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/binfmts.h | 4 ----
fs/exec.c | 8 ++++----
2 files changed, 4 insertions(+), 8 deletions(-)
--- K/include/linux/binfmts.h~4_unexport_arg_helpers 2010-11-30 18:30:45.000000000 +0100
+++ K/include/linux/binfmts.h 2010-11-30 20:38:13.000000000 +0100
@@ -60,10 +60,6 @@ struct linux_binprm{
unsigned long loader, exec;
};
-extern void acct_arg_size(struct linux_binprm *bprm, unsigned long pages);
-extern struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
- int write);
-
#define BINPRM_FLAGS_ENFORCE_NONDUMP_BIT 0
#define BINPRM_FLAGS_ENFORCE_NONDUMP (1 << BINPRM_FLAGS_ENFORCE_NONDUMP_BIT)
--- K/fs/exec.c~4_unexport_arg_helpers 2010-11-30 20:15:11.000000000 +0100
+++ K/fs/exec.c 2010-11-30 20:38:13.000000000 +0100
@@ -165,7 +165,7 @@ out:
#ifdef CONFIG_MMU
-void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
{
struct mm_struct *mm = current->mm;
long diff = (long)(pages - bprm->vma_pages);
@@ -184,7 +184,7 @@ void acct_arg_size(struct linux_binprm *
#endif
}
-struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
+static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
int write)
{
struct page *page;
@@ -298,11 +298,11 @@ static bool valid_arg_len(struct linux_b
#else
-void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
+static inline void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
{
}
-struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
+static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
int write)
{
struct page *page;
--
I strongly like this series. (yes, I made fault to forgot to change compat.c multiple times ;) Unfortunatelly, this is a bit large and I have no time now. I expect I can review this at this or next weekend..... Hopefully, anyoneelse will review this and ignore me.... --
NACK. Same response as the previous three times this patch has been proposed: http://marc.info/?t=128461666500001 http://marc.info/?t=128324705200002 http://marc.info/?t=128272938200002 --
