In multi-threading environment, if the current task(A) have got
the mm->mmap_sem semaphore, and the thread(B) in the same process
is selected to be oom killed, because they shares the same semaphore,
thread B can not really be killed. So __alloc_pages_slowpath turns
to be a infinite loop. Here set all the threads in the group to
TIF_MEMDIE, it gets a chance to break and exit.
Signed-off-by: Anfei Zhou <anfei.zhou@gmail.com>
---
mm/oom_kill.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9b223af..aab9892 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
*/
static void __oom_kill_task(struct task_struct *p, int verbose)
{
+ struct task_struct *t;
+
if (is_global_init(p)) {
WARN_ON(1);
printk(KERN_WARNING "tried to kill init!\n");
@@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
+ for (t = next_thread(p); t != p; t = next_thread(t))
+ set_tsk_thread_flag(t, TIF_MEMDIE);
force_sig(SIGKILL, p);
}
--
1.6.4.rc1
--
I like this patch very much. Thanks, Anfei! --
On Thu, 25 Mar 2010 00:25:05 +0800 Don't we need some sort of locking while walking that ring? Unintuitively it appears to be spin_lock_irq(&p->sighand->siglock). --
This should be always called under tasklist_lock, I think.
At least this seems to be true in Linus's tree.
I'd suggest to do
- set_tsk_thread_flag(p, TIF_MEMDIE);
+ t = p;
+ do {
+ set_tsk_thread_flag(t, TIF_MEMDIE);
+ } while_each_thread(p, t);
but this is matter of taste.
Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
need "force" semantics.
I'd wish I could understand the changelog ;)
Oleg.
--
Yes, this function is always called with read_lock(&tasklist_lock), so
This may need a dedicated patch, there are some other places to
Assume thread A and B are in the same group. If A runs into the oom,
and selects B as the victim, B won't exit because at least in exit_mm(),
it can not get the mm->mmap_sem semaphore which A has already got. So
no memory is freed, and no other task will be selected to kill.
I formatted the patch for -mm tree as David suggested.
---
mm/oom_kill.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -418,8 +418,15 @@ static void dump_header(struct task_stru
*/
static void __oom_kill_task(struct task_struct *p)
{
+ struct task_struct *t;
+
p->rt.time_slice = HZ;
- set_tsk_thread_flag(p, TIF_MEMDIE);
+
+ t = p;
+ do {
+ set_tsk_thread_flag(t, TIF_MEMDIE);
+ } while_each_thread(p, t);
+
force_sig(SIGKILL, p);
}
--
I see. But still I can't understand. To me, the problem is not that B can't exit, the problem is that A doesn't know it should exit. All threads should exit and free ->mm. Even if B could exit, this is not enough. And, to some extent, it doesn't matter if it holds mmap_sem or not. Don't get me wrong. Even if I don't understand oom_kill.c the patch looks obviously good to me, even from "common sense" pov. I am just curious. So, my understanding is: we are going to kill the whole thread group but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE so that any thread can notice this flag and (say, __alloc_pages_slowpath) fail asap. Is my understanding correct? Oleg. --
[Adding Mel Gorman <mel@csn.ul.ie> to the cc]
The problem with this approach is that we could easily deplete all memory
reserves if the oom killed task has an extremely large number of threads,
there has always been only a single thread with TIF_MEMDIE set per cpuset
or memcg; for systems that don't run with cpusets or memory controller,
this has been limited to one thread with TIF_MEMDIE for the entire system.
There's risk involved with suddenly allowing 1000 threads to have
TIF_MEMDIE set and the chances of fully depleting all allowed zones is
much higher if they allocate memory prior to exit, for example.
An alternative is to fail allocations if they are failable and the
allocating task has a pending SIGKILL. It's better to preempt the oom
killer since current is going to be exiting anyway and this avoids a
needless kill.
That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
endlessly looping while making no progress.
Comments?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1610,13 +1610,21 @@ try_next_zone:
}
static inline int
-should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
unsigned long pages_reclaimed)
{
/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
return 0;
+ /* Loop if specifically requested */
+ if (gfp_mask & __GFP_NOFAIL)
+ return 1;
+
+ /* Task is killed, fail the allocation if possible */
+ if (fatal_signal_pending(p))
+ return 0;
+
/*
* In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
* means __GFP_NOFAIL, but that may not be true in other
@@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
return 1;
- /*
- * Don't let big-order allocations ...Can't comment, I do not understand these subtleties. But I'd like to note that fatal_signal_pending() can be true when the process wasn't killed, but another thread does exit_group/exec. I am not saying this is wrong, just I'd like to be sure this didn't --
I'm not sure there's a difference between whether a process was oom killed and received a SIGKILL that way or whether exit_group(2) was used, so I don't think we need to test for (p->signal->flags & SIGNAL_GROUP_EXIT) here. We do need to guarantee that exiting tasks always can get memory, which is the responsibility of setting TIF_MEMDIE. The only thing this patch does is defer calling the oom killer when a task has a pending SIGKILL and then fail the allocation when it would otherwise repeat. Instead of the considerable risk involved with no failing GFP_KERNEL allocations that are under PAGE_ALLOC_COSTLY_ORDER that is typically never done, it may make more sense to retry the allocation with TIF_MEMDIE on the second iteration: in essence, automatically selecting current for oom kill regardless of other oom killed tasks if it already has a pending SIGKILL. oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically selects it for kill so that it has access to memory reserves and the better timeslice. Upon returning to the page allocator, its allocation ...
I am worried...
Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
->sighand != NULL. This is not true if out_of_memory() is called after
current has already passed exit_notify().
Hmm. looking at oom_kill.c... Afaics there are more problems with mt
apllications. select_bad_process() does for_each_process() which can
only see the group leaders. This is fine, but what if ->group_leader
has already exited? In this case its ->mm == NULL, and we ignore the
whole thread group.
IOW, unless I missed something, it is very easy to hide the process
from oom-kill:
int main()
{
pthread_create(memory_hog_func);
syscall(__NR_exit);
}
probably we need something like
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -246,21 +246,27 @@ static enum oom_constraint constrained_a
static struct task_struct *select_bad_process(unsigned long *ppoints,
struct mem_cgroup *mem)
{
- struct task_struct *p;
+ struct task_struct *g, *p;
struct task_struct *chosen = NULL;
struct timespec uptime;
*ppoints = 0;
do_posix_clock_monotonic_gettime(&uptime);
- for_each_process(p) {
+ for_each_process(g) {
unsigned long points;
/*
* skip kernel threads and tasks which have already released
* their mm.
*/
+ p = g;
+ do {
+ if (p->mm)
+ break;
+ } while_each_thread(g, p);
if (!p->mm)
continue;
+
/* skip the init task */
if (is_global_init(p))
continue;
except is should be simplified and is_global_init() should check g.
No?
Oh... proc_oom_score() is racy. We can't trust ->group_leader even
under tasklist_lock. If we race with exit/exec it can point to
nowhere. I'll send the simple fix.
Oleg.
--
We have an even bigger problem if current is in the oom killer at The check for !p->mm was moved in the -mm tree (and the oom killer was entirely rewritten in that tree, so I encourage you to work off of it instead) with oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch to even after the check for PF_EXITING. This is set in the exit path before the ->mm is detached so if the oom killer finds an already exiting task, it will become a no-op since it should eventually free memory and avoids a needless oom kill. --
Can't understand... I thought that in theory even kmalloc(1) can trigger oom. Say, right after exit_mm() we are doing acct_process(), and f_op->write() needs a page. So, you are saying that in this case __page_cache_alloc() OK, but I guess this !p->mm check is still wrong for the same reason. In fact I do not understand why it is needed in select_bad_process() right before oom_badness() which checks ->mm too (and this check is No, afaics, And this reminds that I already complained about this PF_EXITING check. Once again, p is the group leader. It can be dead (no ->mm, PF_EXITING is set) but it can have sub-threads. This means, unless I missed something, any user can trivially disable select_bad_process() forever. Well. Looks like, -mm has a lot of changes in oom_kill.c. Perhaps it would be better to fix these mt bugs first... Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children). Again, this is not right even if we forget about !child->mm check. This list_for_each_entry() can only see the processes forked by the main thread. Likewise, oom_kill_process()->list_for_each_entry() is not right too. Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss(). Oleg. --
Probably something like the patch below makes sense. Note that
"skip kernel threads" logic is wrong too, we should check PF_KTHREAD.
Probably it is better to check it in select_bad_process() instead,
near is_global_init().
The new helper, find_lock_task_mm(), should be used by
oom_forkbomb_penalty() too.
dump_tasks() doesn't need it, it does do_each_thread(). Cough,
__out_of_memory() and out_of_memory() call it without tasklist.
We are going to panic() anyway, but still.
Oleg.
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -129,6 +129,19 @@ static unsigned long oom_forkbomb_penalt
(child_rss / sysctl_oom_forkbomb_thres) : 0;
}
+static find_lock_task_mm(struct task_struct *p)
+{
+ struct task_struct *t = p;
+ do {
+ task_lock(t);
+ if (likely(t->mm && !(t->flags & PF_KTHREAD)))
+ return t;
+ task_unlock(t);
+ } while_each_thred(p, t);
+
+ return NULL;
+}
+
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
@@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
if (p->flags & PF_OOM_ORIGIN)
return 1000;
- task_lock(p);
- mm = p->mm;
- if (!mm) {
- task_unlock(p);
+ p = find_lock_task_mm(p);
+ if (!p)
return 0;
- }
-
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss and swap space use.
@@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
*ppoints = 1000;
}
- /*
- * skip kernel threads and tasks which have already released
- * their mm.
- */
- if (!p->mm)
- continue;
if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
continue;
--
You can't do this for the reason I cited in another email, oom_badness() returning 0 does not exclude a task from being chosen by selcet_bad_process(), it will use that task if nothing else has been found yet. We must explicitly filter it from consideration by checking for !p->mm. --
Yes, you are right. OK, oom_badness() can never return points < 0, we can make it int and oom_badness() can return -1 if !mm. IOW, - unsigned int points; + int points; ... points = oom_badness(...); if (points >= 0 && (points > *ppoints || !chosen)) chosen = p; Oleg. --
oom_badness() and its predecessor badness() in mainline never return negative scores, so I don't see the value in doing this; just filter the task in select_bad_process() with !p->mm as it has always been done. --
David, you continue to ignore my arguments ;) select_bad_process()
must not filter out the tasks with ->mm == NULL.
Once again:
void *memory_hog_thread(void *arg)
{
for (;;)
malloc(A_LOT);
}
int main(void)
{
pthread_create(memory_hog_thread, ...);
syscall(__NR_exit, 0);
}
Now, even if we fix PF_EXITING check, select_bad_process() will always
ignore this process. The group leader has ->mm == NULL.
See?
That is why I think we need something like find_lock_task_mm() in the
pseudo-patch I sent.
Or I missed something?
Oleg.
--
So. Please see the COMPLETELY UNTESTED patches I am sending. They need your review, or feel free to redo these fixes. 4/4 is a bit off-topic. Also, please note the "This patch is not enough" comment in 3/4. Oleg. --
select_bad_process() thinks a kernel thread can't have ->mm != NULL,
this is not true due to use_mm().
Change the code to check PF_KTHREAD.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
--- MM/mm/oom_kill.c~1_FLITER_OUT_KTHREADS 2010-03-31 17:47:14.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 18:51:05.000000000 +0200
@@ -290,8 +290,8 @@ static struct task_struct *select_bad_pr
for_each_process(p) {
unsigned int points;
- /* skip the init task */
- if (is_global_init(p))
+ /* skip the init task and kthreads */
+ if (is_global_init(p) || (p->flags & PF_KTHREAD))
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
@@ -331,8 +331,7 @@ static struct task_struct *select_bad_pr
}
/*
- * skip kernel threads and tasks which have already released
- * their mm.
+ * skip the tasks which have already released their mm.
*/
if (!p->mm)
continue;
--
Acked-by: David Rientjes <rientjes@google.com> --
select_bad_process() checks PF_EXITING to detect the task which
is going to release its memory, but the logic is very wrong.
- a single process P with the dead group leader disables
select_bad_process() completely, it will always return
ERR_PTR() while P can live forever
- if the PF_EXITING task has already released its ->mm
it doesn't make sense to expect it is goiing to free
more memory (except task_struct/etc)
Change the code to ignore the PF_EXITING tasks without ->mm.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- MM/mm/oom_kill.c~2_FIX_PF_EXITING 2010-04-02 18:51:05.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 18:58:37.000000000 +0200
@@ -322,7 +322,7 @@ static struct task_struct *select_bad_pr
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if (p->flags & PF_EXITING) {
+ if ((p->flags & PF_EXITING) && p->mm) {
if (p != current)
return ERR_PTR(-1UL);
--
Even this check is satisfied, it still can't say p is a good victim or it will release memory automatically if multi threaded, as the exiting of p doesn't mean the other threads are going to exit, so the ->mm won't --
Yes, completely agreed. Unfortunately I forgot to copy this into the changelog, but when I discussed this change I mentioned "still not perfect, but much better". I do not really know what is the "right" solution. Even if we fix this check for mt case, we also have CLONE_VM tasks. So, this patch just tries to improve things, to avoid the easy-to-trigger false positives. Oleg. --
What about checking mm->mm_users too? If there are any other users, just let badness judge. CLONE_VM tasks but not mt seem rare, and Agreed. Thanks, --
Even if we forget about get_task_mm() which increments mm_users, it is not
clear to me how to do this check correctly.
Say, mm_users > 1 but SIGNAL_GROUP_EXIT is set. This means this process is
exiting and (ignoring CLONE_VM task) it is going to release its ->mm. But
otoh mm can be NULL.
Perhaps we can do
if ((PF_EXITING && thread_group_empty(p) ||
(p->signal->flags & SIGNAL_GROUP_EXIT) {
// OK, it is exiting
bool has_mm = false;
do {
if (t->mm) {
has_mm = true;
break;
}
} while_each_thread(p, t);
if (!has_mm)
continue;
if (p != current)
return ERR_PTR(-1);
...
}
I dunno.
Oleg.
--
Almost all ->mm == NUL checks in oom_kill.c are wrong.
The current code assumes that the task without ->mm has already
released its memory and ignores the process. However this is not
necessarily true when this process is multithreaded, other live
sub-threads can use this ->mm.
- Remove the "if (!p->mm)" check in select_bad_process(), it is
just wrong.
- Add the new helper, find_lock_task_mm(), which finds the live
thread which uses the memory and takes task_lock() to pin ->mm
- change oom_badness() to use this helper instead of just checking
->mm != NULL.
- As David pointed out, select_bad_process() must never choose the
task without ->mm, but no matter what oom_badness() returns the
task can be chosen if nothing else has been found yet.
Change oom_badness() to return int, change it to return -1 if
find_lock_task_mm() fails, and change select_bad_process() to
check points >= 0.
Note! This patch is not enough, we need more changes.
- oom_badness() was fixed, but oom_kill_task() still ignores
the task without ->mm
- oom_forkbomb_penalty() should use find_lock_task_mm() too,
and it also needs other changes to actually find the first
first-descendant children
This will be addressed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/oom.h | 2 +-
mm/oom_kill.c | 39 +++++++++++++++++++++------------------
2 files changed, 22 insertions(+), 19 deletions(-)
--- MM/include/linux/oom.h~3_FIX_MM_CHECKS 2010-03-31 17:47:14.000000000 +0200
+++ MM/include/linux/oom.h 2010-04-02 19:14:05.000000000 +0200
@@ -40,7 +40,7 @@ enum oom_constraint {
CONSTRAINT_MEMORY_POLICY,
};
-extern unsigned int oom_badness(struct task_struct *p,
+extern int oom_badness(struct task_struct *p,
unsigned long totalpages, unsigned long uptime);
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
--- ...It doesn't make sense to call thread_group_cputime() under task_lock(),
we can drop this lock right after we read get_mm_rss() and save the
value in the local variable.
Note: probably it makes more sense to use sum_exec_runtime instead
of utime + stime, it is much more precise. A task can eat a lot of
CPU time, but its Xtime can be zero.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
--- MM/mm/oom_kill.c~4_FORKBOMB_DROP_TASK_LOCK_EARLIER 2010-04-02 19:55:46.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 20:16:13.000000000 +0200
@@ -110,13 +110,16 @@ static unsigned long oom_forkbomb_penalt
return 0;
list_for_each_entry(child, &tsk->children, sibling) {
struct task_cputime task_time;
- unsigned long runtime;
+ unsigned long runtime, rss;
task_lock(child);
if (!child->mm || child->mm == tsk->mm) {
task_unlock(child);
continue;
}
+ rss = get_mm_rss(child->mm);
+ task_unlock(child);
+
thread_group_cputime(child, &task_time);
runtime = cputime_to_jiffies(task_time.utime) +
cputime_to_jiffies(task_time.stime);
@@ -126,10 +129,9 @@ static unsigned long oom_forkbomb_penalt
* get to execute at all in such cases anyway.
*/
if (runtime < HZ) {
- child_rss += get_mm_rss(child->mm);
+ child_rss += rss;
forkcount++;
}
- task_unlock(child);
}
/*
--
Acked-by: David Rientjes <rientjes@google.com> --
This is the David's patch rediffed agains the recent changes in -mm.
As David pointed out, we should fix select_bad_process() which currently
always selects the first process which was not filtered out before
oom_badness(), no matter what oom_badness() returns.
Change the code to ignore the process if oom_badness() returns 0, this
matters Documentation/filesystems/proc.txt and this merely looks better.
This also allows us to do more cleanups:
- no need to check OOM_SCORE_ADJ_MIN in select_bad_process(),
oom_badness() returns 0 in this case.
- oom_badness() can simply return 0 instead of -1 if the task
has no ->mm.
Now we can make it "unsigned" again, the signness and the
special "points >= 0" in select_bad_process() was added to
preserve the current behaviour.
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/oom.h | 3 ++-
mm/oom_kill.c | 11 ++++-------
2 files changed, 6 insertions(+), 8 deletions(-)
--- MM/include/linux/oom.h~5_BADNESS_DONT_RET_NEGATIVE 2010-04-05 15:39:21.000000000 +0200
+++ MM/include/linux/oom.h 2010-04-05 15:44:49.000000000 +0200
@@ -40,7 +40,8 @@ enum oom_constraint {
CONSTRAINT_MEMORY_POLICY,
};
-extern int oom_badness(struct task_struct *p, unsigned long totalpages);
+extern unsigned int oom_badness(struct task_struct *p,
+ unsigned long totalpages);
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
--- MM/mm/oom_kill.c~5_BADNESS_DONT_RET_NEGATIVE 2010-04-05 15:39:21.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-05 16:09:58.000000000 +0200
@@ -153,7 +153,7 @@ static unsigned long oom_forkbomb_penalt
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom conditions.
*/
-int oom_badness(struct task_struct *p, unsigned long ...I'm not ignoring your arguments, I think you're ignoring what I'm responding to. I prefer to keep oom_badness() to be a positive range as it always has been (and /proc/pid/oom_score has always used an unsigned qualifier), so I disagree that we need to change oom_badness() to return anything other than 0 for such tasks. We need to filter them explicitly in select_bad_process() instead, so please do this there. --
Yes, I thought about /proc/pid/oom_score, but imho this is minor issue. We can s/%lu/%ld/ though, or just report 0 if oom_badness() returns -1. The problem is, we need task_lock() to pin ->mm. Or, we can change find_lock_task_mm() to do get_task_mm() and return mm_struct *. But then oom_badness() (and proc_oom_score!) needs much more changes, it needs the new "struct mm_struct *mm" argument which is not necessarily equal to p->mm. So, I can't agree. Oleg. --
Just have it return 0, meaning never kill, and then ensure "chosen" is never set for an oom_badness() of 0, even if we don't have another task to kill. That's how Documentation/filesystems/proc.txt describes it anyway. --
An oom_badness() score of 0 means "never kill" according to
Documentation/filesystems/proc.txt, so explicitly exclude it from being
selected for kill. These tasks have either detached their p->mm or are
set to OOM_DISABLE.
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -336,6 +336,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
continue;
points = oom_badness(p, totalpages);
+ if (!points)
+ continue;
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
--
then "|| !chosen" can be killed.
with this patch !chosen <=> !*ppoints, and since points > 0
if (points > *ppoints) {
is enough.
Oleg.
--
An oom_badness() score of 0 means "never kill" according to
Documentation/filesystems/proc.txt, so exclude it from being selected for
kill. These tasks have either detached their p->mm or are set to
OOM_DISABLE.
Also removes an unnecessary initialization of points to 0 in
mem_cgroup_out_of_memory(), select_bad_process() does this already.
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 13 ++-----------
1 files changed, 2 insertions(+), 11 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -326,17 +326,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*ppoints = 1000;
}
- /*
- * skip kernel threads and tasks which have already released
- * their mm.
- */
- if (!p->mm)
- continue;
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- continue;
-
points = oom_badness(p, totalpages);
- if (points > *ppoints || !chosen) {
+ if (points > *ppoints) {
chosen = p;
*ppoints = points;
}
@@ -478,7 +469,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
unsigned long limit;
- unsigned int points = 0;
+ unsigned int points;
struct task_struct *p;
if (sysctl_panic_on_oom == 2)
--
OK, agreed, this makes more sense and more clean. I misunderstood you even more before. Thanks, I'll redo/resend 3/4. Oleg. --
It prevents kthreads from being killed. We already identify tasks that are in the exit path with PF_EXITING in select_bad_process() and chosen to make the oom killer a no-op when it's not current so it can exit and free its memory. If it is current, then we're ooming in the exit path and we need to oom kill it so that it gets access to memory reserves so its no The task is in the process of exiting and will do so if its not current, otherwise it will get access to memory reserves since we're obviously oom in the exit path. Thus, we'll be freeing that memory soon or recalling the oom killer to kill additional tasks once those children have been Right, but we need to ensure that the check for !child->mm || child->mm == tsk->mm fails before adding in get_mm_rss(child->mm). It can race and detach its mm prior to the dereference. It would be possible to move the thread_group_cputime() out of this critical section, but I felt it was better to do filter all tasks with child->mm == tsk->mm first before unnecessarily finding the cputime for them. --
Why? You ignored this part:
Say, right after exit_mm() we are doing acct_process(), and f_op->write()
needs a page. So, you are saying that in this case __page_cache_alloc()
can never trigger out_of_memory() ?
why this is not possible?
Just can't understand.
OK, a bad user does
int sleep_forever(void *)
{
pause();
}
int main(void)
{
pthread_create(sleep_forever);
syscall(__NR_exit);
}
Now, every time select_bad_process() is called it will find this process
and PF_EXITING is true, so it just returns ERR_PTR(-1UL). And note that
Why? shouldn't oom_badness() return the same result for any thread
Yes, but we can check child->mm == tsk->mm, call get_mm_counter() and drop
task_lock().
Oleg.
--
In case I wasn't clear... Yes, currently __oom_kill_task(p) is not possible if p->mm == NULL. But your patch adds if (fatal_signal_pending(current)) __oom_kill_task(current); into out_of_memory(). Oleg. --
Ok, and it's possible during the tasklist scan if current is PF_EXITING
and that gets passed to oom_kill_process(), so we need the following
patch. Can I have your acked-by and then I'll propose it to Andrew with a
follow-up that merges __oom_kill_task() into oom_kill_task() since it only
has one caller now anyway?
[ Both of these situations will be current since the oom killer is a
no-op whenever another task is found to be PF_EXITING and
oom_kill_process() wouldn't get called with any other thread unless
oom_kill_quick is enabled or its VM_FAULT_OOM in which cases we kill
current as well. ]
Thanks Oleg.
---
mm/oom_kill.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -459,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
- __oom_kill_task(p);
+ set_tsk_thread_flag(p, TIF_MEMDIE);
return 0;
}
@@ -686,7 +686,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* its memory.
*/
if (fatal_signal_pending(current)) {
- __oom_kill_task(current);
+ set_tsk_thread_flag(current, TIF_MEMDIE);
return;
}
--
Yes, but this is harmless, afaics. The task is either current or it was found by select_bad_process() under tasklist. This means it is safe to Yes, I think this fix is needed. Oleg. --
It matches the already-existing comment that only says we need to set TIF_MEMDIE so it can quickly exit rather than call __oom_kill_task(), so Ok, I'll add your acked-by and send this to Andrew with a follow-up that consolidates __oom_kill_task() into oom_kill_task(), thanks. --
It can, but the check for p->mm is sufficient since exit_notify() takes
write_lock_irq(&tasklist_lock) that the oom killer holds for read, so the
rule is that whenever we have a valid p->mm, we have a valid p->sighand
and can do force_sig() while under tasklist_lock. The only time we call
oom_kill_process() without holding a readlock on tasklist_lock is for
current during pagefault ooms and we know it's not exiting because it's in
We cannot rely on oom_badness() to filter this task because we still
select it as our chosen task even with a badness score of 0 if !chosen, so
we must filter these threads ahead of time:
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
}
Filtering on !p->mm prevents us from doing "if (points > *ppoints ||
(!chosen && p->mm))" because it's just cleaner and makes this rule
explicit.
Your point about p->mm being non-NULL for kthreads using use_mm() is
taken, we should probably just change the is_global_init() check in
select_bad_process() to p->flags & PF_KTHREAD and ensure we reject
Hmm, so it looks like we need to filter on !p->mm before checking for
PF_EXITING so that tasks that are EXIT_ZOMBIE won't make the oom killer
oom_forkbomb_penalty() only cares about first-descendant children that
do not share the same memory, so we purposely penalize the parent so that
it is more biased to select for oom kill and then it will sacrifice these
You could, but then you'd be calling thread_group_cputime() for all
We need task_lock() to ensure child->mm hasn't detached between the check
for child->mm == tsk->mm and get_mm_rss(child->mm). So I'm not sure what
you're trying to improve with this variation, it's a tradeoff between
calling thread_group_cputime() under task_lock() for a subset of a task's
threads when we already need to hold task_lock() anyway vs. calling it for
all threads unconditionally.
--
Yes, but I meant out_of_memory()->__oom_kill_task(current). OK, we
Yes, but we have to check both is_global_init() and PF_KTHREAD.
The "patch" I sent checks PF_KTHREAD in find_lock_task_mm(), but as I
As it was already discussed, it is not easy to check !p->mm. Once
again, we must not filter out the task just because its ->mm == NULL.
Probably the best change for now is
- if (p->flags & PF_EXITING) {
+ if (p->flags & PF_EXITING && p->mm) {
I see, but the code doesn't really do this. I mean, it doesn't really
see the first-descendant children, only those which were forked by the
main thread.
Look. We have a main thread M and the sub-thread T. T forks a lot of
processes which use a lot of memory. These processes _are_ the first
descendant children of the M+T thread group, they should be accounted.
But M->children list is empty.
oom_forkbomb_penalty() and oom_kill_process() should do
t = tsk;
do {
list_for_each_entry(child, &t->children, sibling) {
... take child into account ...
}
See the patch below. Yes, this is minor, but it is always good to avoid
the unnecessary locks, and thread_group_cputime() is O(N).
Not only for performance reasons. This allows to change the locking in
thread_group_cputime() if needed without fear to deadlock with task_lock().
Oleg.
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -97,13 +97,16 @@ static unsigned long oom_forkbomb_penalt
return 0;
list_for_each_entry(child, &tsk->children, sibling) {
struct task_cputime task_time;
- unsigned long runtime;
+ unsigned long runtime, this_rss;
task_lock(child);
if (!child->mm || child->mm == tsk->mm) {
task_unlock(child);
continue;
}
+ this_rss = get_mm_rss(child->mm);
+ task_unlock(child);
+
thread_group_cputime(child, &task_time);
runtime = cputime_to_jiffies(task_time.utime) +
cputime_to_jiffies(task_time.stime);
@@ -113,10 +116,9 @@ static unsigned long oom_forkbomb_penalt
* get to execute at all in such ...In this case, it seems more appropriate that we would penalize T and not M since it's not necessarily responsible for the behavior of the children it This patch looks good, will you send it to Andrew with a changelog and sign-off line? Also feel free to add: Acked-by: David Rientjes <rientjes@google.com> --
We can't. Any fatal signal sent to any sub-thread kills the whole thread Since a) they share the same ->mm and b) they share their children, I don't think we should separate T and M. ->children is per_thread. But this is only because we have some strange historiral oddities like __WNOTHREAD. Otherwise, it is not correct to assume that the child of T is not the child of M. Any process is the child of its parent's thread group, not the thread which actually called Thanks! already in -mm. Oleg. --
proc_oom_score(task) have a reference to task_struct, but that is all.
If this task was already released before we take tasklist_lock
- we can't use task->group_leader, it points to nowhere
- it is not safe to call badness() even if this task is
->group_leader, has_intersects_mems_allowed() assumes
it is safe to iterate over ->thread_group list.
Add the pid_alive() check to ensure __unhash_process() was not called.
Note: I think we shouldn't use ->group_leader, badness() should return
the same result for any sub-thread. However this is not true currently,
and I think that ->mm check and list_for_each_entry(p->children) in
badness are not right.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- 34-rc1/fs/proc/base.c~OOM_SCORE 2010-03-22 16:36:28.000000000 +0100
+++ 34-rc1/fs/proc/base.c 2010-03-30 18:23:50.000000000 +0200
@@ -430,12 +430,13 @@ static const struct file_operations proc
unsigned long badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points;
+ unsigned long points = 0;
struct timespec uptime;
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ if (pid_alive(task))
+ points = badness(task->group_leader, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
--
->siglock is no longer needed to access task->signal, change
oom_adjust_read() and oom_adjust_write() to read/write oom_adj
lockless.
Yes, this means that "echo 2 >oom_adj" and "echo 1 >oom_adj"
can race and the second write can win, but I hope this is OK.
Also, cleanup the EACCES case a bit.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 28 ++++++----------------------
1 file changed, 6 insertions(+), 22 deletions(-)
--- 34-rc1/fs/proc/base.c~PROC_5_OOM_ADJ 2010-03-30 18:23:50.000000000 +0200
+++ 34-rc1/fs/proc/base.c 2010-03-30 19:14:43.000000000 +0200
@@ -981,22 +981,16 @@ static ssize_t oom_adjust_read(struct fi
{
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
char buffer[PROC_NUMBUF];
+ int oom_adjust;
size_t len;
- int oom_adjust = OOM_DISABLE;
- unsigned long flags;
if (!task)
return -ESRCH;
- if (lock_task_sighand(task, &flags)) {
- oom_adjust = task->signal->oom_adj;
- unlock_task_sighand(task, &flags);
- }
-
+ oom_adjust = task->signal->oom_adj;
put_task_struct(task);
len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
-
return simple_read_from_buffer(buf, count, ppos, buffer, len);
}
@@ -1006,7 +1000,6 @@ static ssize_t oom_adjust_write(struct f
struct task_struct *task;
char buffer[PROC_NUMBUF];
long oom_adjust;
- unsigned long flags;
int err;
memset(buffer, 0, sizeof(buffer));
@@ -1025,20 +1018,11 @@ static ssize_t oom_adjust_write(struct f
task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
return -ESRCH;
- if (!lock_task_sighand(task, &flags)) {
- put_task_struct(task);
- return -ESRCH;
- }
-
- if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
- unlock_task_sighand(task, &flags);
- put_task_struct(task);
- return -EACCES;
- }
- task->signal->oom_adj = oom_adjust;
-
- unlock_task_sighand(task, &flags);
+ if (task->signal->oom_adj <= oom_adjust || ...Ok, but could you base this on -mm at http://userweb.kernel.org/~akpm/mmotm/ since an additional tunable has been added (oom_score_adj), which does the same thing? --
Ah, OK, will do. Thanks David. Oleg. --
David, I just can't understand why
oom-badness-heuristic-rewrite.patch
duplicates the related code in fs/proc/base.c and why it preserves
the deprecated signal->oom_adj.
OK. Please forget about lock_task_sighand/signal issues. Can't we kill
signal->oom_adj and create a single helper for both
/proc/pid/{oom_adj,oom_score_adj} ?
static ssize_t oom_any_adj_write(struct file *file, const char __user *buf,
size_t count, bool deprecated_mode)
{
struct task_struct *task;
char buffer[PROC_NUMBUF];
unsigned long flags;
long oom_score_adj;
int err;
memset(buffer, 0, sizeof(buffer));
if (count > sizeof(buffer) - 1)
count = sizeof(buffer) - 1;
if (copy_from_user(buffer, buf, count))
return -EFAULT;
err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
if (err)
return -EINVAL;
if (depraceted_mode) {
if (oom_score_adj == OOM_ADJUST_MAX)
oom_score_adj = OOM_SCORE_ADJ_MAX;
else
oom_score_adj = (oom_score_adj * OOM_SCORE_ADJ_MAX) /
-OOM_DISABLE;
}
if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
oom_score_adj > OOM_SCORE_ADJ_MAX)
return -EINVAL;
task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
return -ESRCH;
if (!lock_task_sighand(task, &flags)) {
put_task_struct(task);
return -ESRCH;
}
if (oom_score_adj < task->signal->oom_score_adj &&
!capable(CAP_SYS_RESOURCE)) {
unlock_task_sighand(task, &flags);
put_task_struct(task);
return -EACCES;
}
task->signal->oom_score_adj = oom_score_adj;
unlock_task_sighand(task, &flags);
put_task_struct(task);
return count;
}
This is just the current oom_score_adj_read() + "if (depraceted_mode)"
which does oom_adj -> oom_score_adj conversion.
Now,
static ssize_t oom_adjust_write(...)
{
printk_once(KERN_WARNING "... deprecated ...\n");
return oom_any_adj_write(..., true);
}
static ssize_t oom_score_adj_write(...)
{
return oom_any_adj_write(..., false);
}
The same for ...You could combine the two write functions together and then two read That doesn't work for depraceted_mode (sic), you'd need to test for There have been efforts to reuse as much of this code as possible for other sysctl handlers as well, you might be better off looking for other users of the common read and write code and then merging them first (comm_write, proc_coredump_filter_write, etc). --
Yes, probably "if (depraceted_mode)" should do more checks, I didn't try to verify that MIN/MAX are correctly converted. I showed this code to explain David, sorry ;) Right now I'd better try to stop the overloading of ->siglock. And, I'd like to shrink struct_signal if possible, but this is minor. Oleg. --
Ok, please cc me on the patch, it will be good to get rid of the duplicate
Do we need ->siglock? Why can't we just do
struct sighand_struct *sighand;
struct signal_struct *sig;
rcu_read_lock();
sighand = rcu_dereference(task->sighand);
if (!sighand) {
rcu_read_unlock();
return;
}
sig = task->signal;
... load/store to sig ...
rcu_read_unlock();
instead?
--
No. Before signals-make-task_struct-signal-immutable-refcountable.patch (actually, series of patches), this can't work. ->signal is not protected by rcu, and ->sighand != NULL doesn't mean ->signal != NULL. (yes, thread_group_cputime() is wrong too, but currently it is never called lockless). After signals-make-task_struct-signal-immutable-refcountable.patch, we do not need any checks at all, it is always safe to use ->signal. But. Unless we kill signal->oom_adj, we have another reason for ->siglock, we can't update both oom_adj and oom_score_adj atomically, and if we race with another thread they can be inconsistent wrt each other. Yes, oom_adj is not actually used, except we report it back to user-space, but still. So, I am going to send 2 patches. The first one factors out the code in base.c and kills signal->oom_adj, the next one removes ->siglock. Oleg. --
I think it would be better to just use task and not task->group_leader. --
Sure, agreed. I preserved ->group_leader just because I didn't understand why the current code doesn't use task. But note that pid_alive() is still needed. I'll check the code in -mm and resend. Oleg. --
Oh. No, with the current code in -mm pid_alive() is not needed if
we use task instead of task->group_leader. But once we fix
oom_forkbomb_penalty() it will be needed again.
But. Oh well. David, oom-badness-heuristic-rewrite.patch changed badness()
to consult p->signal->oom_score_adj. Until recently this was wrong when it
is called from proc_oom_score().
This means oom-badness-heuristic-rewrite.patch depends on
signals-make-task_struct-signal-immutable-refcountable.patch, or we
need the pid_alive() check again.
oom_badness() gets the new argument, long totalpages, and the callers
were updated. However, long uptime is not used any longer, probably
it make sense to kill this arg and simplify the callers? Unless you
are going to take run-time into account later.
So, I think -mm needs the patch below, but I have no idea how to
write the changelog ;)
Oleg.
--- x/fs/proc/base.c
+++ x/fs/proc/base.c
@@ -430,12 +430,13 @@ static const struct file_operations proc
/* The badness from the OOM killer */
static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points;
+ unsigned long points = 0;
struct timespec uptime;
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = oom_badness(task->group_leader,
+ if (pid_alive(task))
+ points = oom_badness(task,
global_page_state(NR_INACTIVE_ANON) +
global_page_state(NR_ACTIVE_ANON) +
global_page_state(NR_INACTIVE_FILE) +
--
oom-badness-heuristic-rewrite.patch didn't change anything, Linus' tree currently dereferences p->signal->oom_adj which is no different from dereferencing p->signal->oom_score_adj without a refcount on the signal_struct in -mm. oom_adj was moved to struct signal_struct in This should be protected by the get_proc_task() on the inode before this function is called from proc_info_read(). --
Yes, I wrongly blaimed oom-badness-heuristic-rewrite.patch, vanilla does the same. Now this is really bad, and I am resending my patch. David, Andrew, I understand it (textually) conflicts with oom-badness-heuristic-rewrite.patch, but this bug should be fixed imho before other changes. I hope it will be easy to fixup this chunk @@ -447,7 +447,13 @@ static int proc_oom_score(struct task_st do_posix_clock_monotonic_gettime(&uptime); read_lock(&tasklist_lock); - points = badness(task->group_leader, uptime.tv_sec); + points = oom_badness(task->group_leader, No, get_proc_task() shouldn't (and can't) do this. To clarify, get_proc_task() does check the task wasn't unhashed, but nothing can prevent from release_task() after that. Once again, only task_struct itself is protected by get_task_struct(), nothing more. Oleg. --
proc_oom_score(task) have a reference to task_struct, but that is all.
If this task was already released before we take tasklist_lock
- we can't use task->group_leader, it points to nowhere
- it is not safe to call badness() even if this task is
->group_leader, has_intersects_mems_allowed() assumes
it is safe to iterate over ->thread_group list.
- even worse, badness() can hit ->signal == NULL
Add the pid_alive() check to ensure __unhash_process() was not called.
Also, use "task" instead of task->group_leader. badness() should return
the same result for any sub-thread. Currently this is not true, but
this should be changed anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- TTT/fs/proc/base.c~PROC_OOM_SCORE 2010-03-11 13:11:50.000000000 +0100
+++ TTT/fs/proc/base.c 2010-04-01 14:41:17.000000000 +0200
@@ -442,12 +442,13 @@ static const struct file_operations proc
unsigned long badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points;
+ unsigned long points = 0;
struct timespec uptime;
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ if (pid_alive(task))
+ points = badness(task, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
--
Acked-by: David Rientjes <rientjes@google.com> Andrew, this is 2.6.34 material and should be backported to stable. It's not introduced by the recent oom killer rewrite pending in -mm, but it will require a trivial merge resolution on that work. --
I think this method is okay, but it's easy to trigger another bug of
oom. See select_bad_process():
if (!p->mm)
continue;
!p->mm is not always an unaccepted condition. e.g. "p" is killed and
doing exit, setting tsk->mm to NULL is before releasing the memory.
And in multi threading environment, this happens much more.
In __out_of_memory(), it panics if select_bad_process returns NULL.
The simple way to fix it is as mem_cgroup_out_of_memory() does.
So I think both of these 2 patches are needed.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index afeab2a..9aae208 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -588,12 +588,8 @@ retry:
if (PTR_ERR(p) == -1UL)
return;
- /* Found nothing?!?! Either we hang forever, or we panic. */
- if (!p) {
- read_unlock(&tasklist_lock);
- dump_header(NULL, gfp_mask, order, NULL);
- panic("Out of memory and no killable processes...\n");
- }
+ if (!p)
+ p = current;
if (oom_kill_process(p, gfp_mask, order, points, NULL,
--
This is fixed by oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch in the -mm tree. See The reason p wasn't selected is because it fails to meet the criteria for candidacy in select_bad_process(), not necessarily because of a race with the !p->mm check that the -mm patch cited above fixes. It's quite possible that current has an oom_adj value of OOM_DISABLE, for example, where this would be wrong. --
I see. And what about changing mem_cgroup_out_of_memory() too?
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0cb1ca4..9e89a29 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -510,8 +510,10 @@ retry:
if (PTR_ERR(p) == -1UL)
goto out;
- if (!p)
- p = current;
+ if (!p) {
+ read_unlock(&tasklist_lock);
+ panic("Out of memory and no killable processes...\n");
+ }
if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
--
The memory controller is different because it must kill a task even if This actually does appear to be necessary but for a different reason: if current is unkillable because it has OOM_DISABLE, for example, then oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory() will infinitely loop. Kame-san? --
On Tue, 30 Mar 2010 13:29:29 -0700 (PDT) When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE), we can do nothing. (we can't panic because container's death != system death.) Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer, I think infinite-loop will not be critical probelm for the whole system. And, now, memcg has oom-kill-disable + oom-kill-notifier features. So, If a memcg goes into OOM and there is no killable process, but oom-kill is not disabled by memcg.....it means system admin's mis-configuraton. He can stop inifite loop by hand, anyway. # echo 1 > ..../group_A/memory.oom_control Thanks, -Kame --
Then we should be able to do this since current is by definition unkillable since it was not found in select_bad_process(), right? --- diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) read_lock(&tasklist_lock); retry: p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL); - if (PTR_ERR(p) == -1UL) + if (!p || PTR_ERR(p) == -1UL) goto out; - if (!p) - p = current; - if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, "Memory cgroup out of memory")) goto retry; --
On Tue, 30 Mar 2010 23:07:08 -0700 (PDT) To me, this patch is acceptable and seems reasnoable. But I didn't joined to memcg development when this check was added and don't know why kill current.. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e81... Addinc Balbir to CC. Maybe situation is changed now. Because we can stop inifinite loop (by hand) and there is no rushing oom-kill callers, this change is acceptable. Thanks, --
The reason for adding current was that we did not want to loop forever, since it stops forward progress - no error/no forward progress. It made sense to oom kill the current process, so that the By hand is not always possible if we have a large number of cgroups (I've seen a setup with 2000 cgroups on libcgroup ML). 2000 cgroups * number of processes make the situation complex. I think using OOM notifier is now another way of handling such a situation. -- Three Cheers, Balbir --
oom_kill_process() will fail on current since it wasn't selected as an eligible task to kill in select_bad_process() and we know it to be a member of the memcg, so there's no point in trying to kill it. --
On Wed, 31 Mar 2010 12:00:07 +0530 "By hand" includes "automatically with daemon program", of course. Hmm, in short, your opinion is "killing current is good for now" ? I have no strong opinion, here. (Because I'll recommend all customers to disable oom kill if they don't want any task to be killed automatically.) Thanks, -Kame --
I think there're a couple of options: either define threshold notifiers with memory.usage_in_bytes so userspace can proactively address low memory situations prior to oom, or use the oom notifier after setting echo 1 > /dev/cgroup/blah/memory.oom_control to address those issues in userspace as they happen. If userspace wants to defer back to the kernel oom killer because it can't raise max_usage_in_bytes, then echo 0 > /dev/cgroup/blah/memory.oom_control should take care of it instantly and I'd rather see a misconfigured memcg with tasks that are OOM_DISABLE but not memcg->oom_kill_disable to be starved of memory than panicking the entire system. Those are good options for users having to deal with low memory situations, thanks for continuing to work on it! --
Seems reasonable. This will be checked on every major loop in the This is a lot less clear. GFP_NOFAIL is rare so this is basically saying that all threads with a fatal signal pending can ignore watermarks. This is dangerous because if 1000 threads get killed, there is a possibility of deadlocking the system. Why not obey the watermarks and just not retry the loop later and fail I'm ok with the should_alloc_retry() change but am a lot less ok with ignoring watermarks just because a fatal signal is pending and I think the nofail changes to __alloc_pages_slowpath() are unnecessary as should_alloc_retry() should end up failing the allocations. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I don't quite understand the comment, this is only for __GFP_NOFAIL allocations, which you say are rare, so a large number of threads won't be The above check for (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) essentially oom kills p without invoking the oom killer before direct reclaim is invoked. We know it has a pending SIGKILL and wants to exit, so we allow it to allocate beyond the min watermark to avoid costly It is, but only after the oom killer is called. We don't want to needlessly kill another task here when p has already been killed but may not be PF_EXITING yet. --
Sorry, I typod. GFP_NOFAIL is rare but this is basically saying that all threads with a fatal signal and using NOFAIL can ignore watermarks. I don't think there is any caller in an exit path will be using GFP_NOFAIL as it's most common user is file-system related but it still feels unnecssary Fair point. How about just checking before __alloc_pages_may_oom() is called then? This check will be then in a slower path. I recognise this means that it is also only checked when direct reclaim is failing but there is at least one good reason for it. With this change, processes that have been sigkilled may now fail allocations that they might not have failed before. It would be difficult to trigger but here is one possible problem with this change; 1. System was borderline with some trashing 2. User starts program that gobbles up lots of memory on page faults, trashing the system further and annoying the user 3. User sends SIGKILL 4. Process was faulting and returns NULL because fatal signal was pending 5. Fault path returns VM_FAULT_OOM 6. Arch-specific path (on x86 anyway) calls out_of_memory again because VM_FAULT_OOM was returned. Ho hum, I haven't thought about this before but it's also possible that a process that is fauling that gets oom-killed will trigger a cascading OOM kill. If the system was heavily trashing, it might mean a large number of processes get killed. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Ok, that's reasonable. We already handle this case indirectly in -mm. oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch in -mm makes the oom killer set TIF_MEMDIE for current and return without killing any other task; it's unnecesary to check if !test_thread_flag(TIF_MEMDIE) before that since the oom killer will be a no-op anyway if there exist TIF_MEMDIE threads. The problem is that the should_alloc_retry() logic isn't checked when the oom killer is called, we immediately retry instead even if the oom killer didn't do anything. So if the oom killed task fails to exit because it's looping in the page allocator, that's going to happen forever since reclaim has failed and the oom killer can't kill anything else (or it's __GFP_NOFAIL and __alloc_pages_may_oom() will infinitely loop without ever returning). I guess this could potentially deplete memory reserves if too many threads have fatal signals and the oom killer is constantly invoked, regardless of __GFP_NOFAIL or not. That's why we have always opted to kill a memory hogging task instead via a tasklist scan: we want to set TIF_MEMDIE for as few tasks as possible with a large upside of memory freeing. I'm wondering if we should check should_alloc_retry() first, it seems like we could get rid of a few different branches in the oom killer path by Yeah, that's what oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch Pagefault ooms default to killing current first in -mm and only kill another task if current is unkillable for the architectures that use pagefault_out_of_memory(); the rest of the architectures such as powerpc just kill current. So while this scenario is plausible, I don't think there would be a large number of processes getting killed: pagefault_out_of_memory() will kill current and give it access to memory reserves and the oom killer won't perform any needless oom killing while that is happening. --
If B can exit, its memory will be freed, and A will be able to allocate the memory, so A won't loop here. Regards, --
Which memory? I thought, we are talking about the memory used by ->mm ? Oleg. --
There is also a little kernel struct related to the task can be freed, but I think you are correct, the memory used by ->mm takes more effect, and it won't be freed even B exits. So I agree you on: " the problem is not that B can't exit, the problem is that A doesn't know it should exit. All threads should exit and free ->mm. Even if B could exit, this is not enough. And, to some extent, it doesn't matter if it holds mmap_sem or not. " Thanks, --
I like the concept, but I agree that it would probably be better to write it as Oleg suggested. The oom killer has been rewritten in the -mm tree and so this patch doesn't apply cleanly, would it be possible to rebase to mmotm with the suggested coding sytle and post this again? See http://userweb.kernel.org/~akpm/mmotm/mmotm-readme.txt Thanks! --
