Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations

Previous thread: [PATCH] mtd: misplaced parenthesis in find_boot_record() by Roel Kluin on Monday, February 15, 2010 - 2:57 pm. (4 messages)

Next thread: [PATCH] dvb-usb/opera1: misplaced parenthesis by Roel Kluin on Monday, February 15, 2010 - 3:30 pm. (1 message)
From: David Rientjes
Date: Monday, February 15, 2010 - 3:19 pm

This patchset is a rewrite of the out of memory killer to address several
issues that have been raised recently.  The most notable change is a
complete rewrite of the badness heuristic that determines which task is
killed; the goal was to make it as simple and predictable as possible
while still addressing issues that plague the VM.

Changes from version 1:

 - updated to mmotm-2010-02-11-21-55

 - when iterating the tasklist for mempolicy-constrained oom conditions,
   the node of the cpu that a MPOL_F_LOCAL task is running on is now
   intersected with the page allocator's nodemask to determine whether it
   should be a candidate for oom kill.

 - added: [patch 4/9] oom: remove compulsory panic_on_oom mode

 - /proc/pid/oom_score_adj was added to prevent ABI breakage for
   applications using /proc/pid/oom_adj.  /proc/pid/oom_adj may still be
   used with the old range but it is then scaled to oom_score_adj units
   for a rough linear approximation.  There is no loss in functionality
   from the old interface.

 - added: [patch 6/9] oom: deprecate oom_adj tunable

This patchset is based on mmotm-2010-02-11-21-55 because of the following
dependencies:

	[patch 5/9] oom: badness heuristic rewrite:
		mm-count-swap-usage.patch

	[patch 7/9] oom: replace sysctls with quick mode:
		sysctl-clean-up-vm-related-variable-delcarations.patch

To apply to mainline, download 2.6.33-rc8 and apply

	mm-clean-up-mm_counter.patch
	mm-avoid-false-sharing-of-mm_counter.patch
	mm-avoid-false_sharing-of-mm_counter-checkpatch-fixes.patch
	mm-count-swap-usage.patch
	mm-count-swap-usage-checkpatch-fixes.patch
	mm-introduce-dump_page-and-print-symbolic-flag-names.patch
	sysctl-clean-up-vm-related-variable-declarations.patch
	sysctl-clean-up-vm-related-variable-declarations-fix.patch

from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first.
---
 Documentation/feature-removal-schedule.txt |   30 +
 Documentation/filesystems/proc.txt         |  100 +++---
 ...
From: David Rientjes
Date: Monday, February 15, 2010 - 3:20 pm

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.
Unfortunately, this often kills in a seemingly random fashion based on
the ordering of the selected task's child list.  Additionally, it is not
guaranteed at all to free a large amount of memory that we need to
prevent additional oom killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists.  The worst child is indicated with the
highest badness() score.  This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful
and then the parent if none were found while the new code simply kills
the most memory-hogging task or the parent.  Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or
has a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both
cases, so the final oom_kill_task() will always succeed.

Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -432,7 +432,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
 			    const char *message)
 {
+	struct task_struct *victim = p;
 	struct task_struct *c;
+	unsigned long victim_points = 0;
+	struct timespec uptime;
 
 	if ...
From: Nick Piggin
Date: Monday, February 15, 2010 - 11:15 pm

Acked-by: Nick Piggin <npiggin@suse.de>
--

From: David Rientjes
Date: Monday, February 15, 2010 - 3:20 pm

If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
regardless of whether the memory allocation is constrained by either a
mempolicy or cpuset.

Since mempolicy-constrained out of memory conditions now iterate through
the tasklist and select a task to kill, it is possible to panic the
machine if all tasks sharing the same mempolicy nodes (including those
with default policy, they may allocate anywhere) or cpuset mems have
/proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
to the compulsory panic_on_oom setting of 2, so the mode is removed.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/sysctl/vm.txt |   20 ++++----------------
 mm/oom_kill.c               |    5 -----
 2 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -559,25 +559,13 @@ swap-intensive.
 
 panic_on_oom
 
-This enables or disables panic on out-of-memory feature.
+If this is set to zero, the oom killer will be invoked when the kernel is out of
+memory and direct reclaim cannot free any pages.  It will select a memory-
+hogging task that frees up a large amount of memory to kill.
 
-If this is set to 0, the kernel will kill some rogue process,
-called oom_killer.  Usually, oom_killer can kill rogue processes and
-system will survive.
-
-If this is set to 1, the kernel panics when out-of-memory happens.
-However, if a process limits using nodes by mempolicy/cpusets,
-and those nodes become memory exhaustion status, one process
-may be killed by oom-killer. No panic occurs in this case.
-Because other nodes' memory may be free. This means system total status
-may be not fatal yet.
-
-If this is set to 2, the kernel panics compulsorily even on the
-above-mentioned.
+If this is set to non-zero, the machine will panic when out of memory.
 
 The default value is 0.
-1 and 2 are for failover of clustering. ...
From: KAMEZAWA Hiroyuki
Date: Monday, February 15, 2010 - 5:00 pm

On Mon, 15 Feb 2010 14:20:09 -0800 (PST)

NACK. In an enviroment which depends on cluster-fail-over, this is useful
even if in such situation.

Thanks,
-Kame

--

From: David Rientjes
Date: Monday, February 15, 2010 - 5:14 pm

You don't understand that the behavior has changed ever since 
mempolicy-constrained oom conditions are now affected by a compulsory 
panic_on_oom mode, please see the patch description.  It's absolutely 
insane for a single sysctl mode to panic the machine anytime a cpuset or 
mempolicy runs out of memory and is more prone to user error from setting 
it without fully understanding the ramifications than any use it will ever 
do.  The kernel already provides a mechanism for doing this, OOM_DISABLE.  
if you want your cpuset or mempolicy to risk panicking the machine, set 
all tasks that share its mems or nodes, respectively, to OOM_DISABLE.  
This is no different from the memory controller being immune to such 
panic_on_oom conditions, stop believing that it is the only mechanism used 
in the kernel to do memory isolation.
--

From: KAMEZAWA Hiroyuki
Date: Monday, February 15, 2010 - 5:23 pm

On Mon, 15 Feb 2010 16:14:22 -0800 (PST)
You don't explain why "we _have to_ remove API which is used"

Thanks,
-Kame



--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 2:02 am

First, I'm not stating that we _have_ to remove anything, this is a patch 
proposal that is open for review.

Second, I believe we _should_ remove panic_on_oom == 2 because it's no 
longer being used as it was documented: as we've increased the exposure of 
the oom killer (memory controller, pagefault ooms, now mempolicy tasklist 
scanning), we constantly have to re-evaluate the semantics of this option 
while a well-understood tunable with a long history, OOM_DISABLE, already 
does the equivalent.  The downside of getting this wrong is that the 
machine panics when it shouldn't have because of an unintended consequence 
of the mode being enabled (a mempolicy ooms, for example, that was created 
by the user).  When reconsidering its semantics, I'd personally opt on the 
safe side and make sure the machine doesn't panic unnecessarily and 
instead require users to use OOM_DISABLE for tasks they do not want to be 
oom killed.
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 4:42 pm

On Tue, 16 Feb 2010 01:02:28 -0800 (PST)

Please don't. I had a chance to talk with customer support team and talked
about panic_on_oom briefly. I understood that panic_on_oom_alyways+kdump
is the strongest tool for investigating customer's OOM situtation and do
the best advice to them. panic_on_oom_always+kdump is the 100% information
as snapshot when oom-killer happens. Then, it's easy to investigate and
explain what is wront. They sometimes discover memory leak (by some prorietary
driver) or miss-configuration of the system (as using unnecessary bounce buffer.)

Then, please leave panic_on_oom=always.
Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
And yes, I'll add something similar to memcg. freeze_at_oom or something.

Thanks,
-Kame




--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 4:54 pm

Ok, I'm not looking to cause your customers unnecessary grief by removing 
an option that they use, even though the same effect is possible by 
setting all tasks to OOM_DISABLE.  I'll remove this patch in the next 

Memcg isn't a special case here, it should also panic the machine if 
panic_on_oom == 2, so if we aren't going to remove this option then I 
agree with Nick that we need to panic from mem_cgroup_out_of_memory() as 
well.  Some users use cpusets, for example, for the same effect of memory 
isolation as you use memcg, so panicking in one scenario and not the other 
is inconsistent.
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 5:01 pm

On Tue, 16 Feb 2010 15:54:50 -0800 (PST)
Hmm, I have a few reason to add special behavior to memcg rather than panic.

 - freeze_at_oom is enough.
   If OOM can be notified, the management daemon can do useful jobs. Shutdown
   all other cgroups or migrate them to other host and do kdump.

 - memcg's oom is not very complicated.
   Because we just counts RSS+FileCache

But, Hmm...I'd like to go this way.

 1. At first, support panic_on_oom=2 in memcg.

 2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
    and don't call memcg_out_of_memory in oom_kill.c in this case. Because
    we don't kill anything. Taking coredumps of all procs in memcg is not
    very difficult.

I need to discuss with memcg guys. But this will be a way to go, I think

Thanks,
-Kame

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 5:31 pm

This should panic in mem_cgroup_out_of_memory() and the documentation 
should be added to Documentation/sysctl/vm.txt.

The memory controller also has some protection in the pagefault oom 
handler that seems like it could be made more general: instead of checking 
for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for 
already oom killed task (checking for the TIF_MEMDIE bit) and check all 
zones for ZONE_OOM_LOCKED.  If no oom killed tasks are found and no zones 
are locked, we can check sysctl_panic_on_oom and invoke the system-wide 

The oom notifier would be at a higher level than the oom killer, the oom 
killer's job is simply to kill a task when it is called.  So for these 
particular cases, you would never even call into out_of_memory() to panic 
the machine in the first place.  Hopefully, the oom notifier can be made 
to be more generic as its own cgroup rather than only being used by memcg, 
but if such a userspace notifier would defer to the kernel oom killer, it 
should panic when panic_on_oom == 2 is selected regardless of whether it 
is constrained or not.  Thus, we can keep the sysctl_panic_on_oom logic in 
the oom killer (both in out_of_memory() and mem_cgroup_out_of_memory()) 
without risk of unnecessarily panic whenever an oom notifier or 
freeze_at_oom setting intercepts the condition.
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 5:41 pm

On Tue, 16 Feb 2010 16:31:39 -0800 (PST)
cpuset's difficulty is that there are some methods which share the limitation.

It's not simple that we have
  - cpuset
  - mempolicy per task
  - mempolicy per vma

plz remove memcg's hook after doing that. Current implemantation is desgined 

That's my point. 

Thanks,
-Kame


--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 5:54 pm

Ok, I'll eliminate pagefault_out_of_memory() and get it to use 
out_of_memory() by only checking for constrained_alloc() when

Great, are you planning on implementing a cgroup that is based on roughly 
on the /dev/mem_notify patchset so userspace can poll() a file and be 
notified of oom events?  It would help beyond just memcg, it has an 
application to cpusets (adding more mems on large systems) as well.  It 
can also be used purely to preempt the kernel oom killer and move all the 
policy to userspace even though it would be sacrificing TIF_MEMDIE.
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 6:03 pm

On Tue, 16 Feb 2010 16:54:31 -0800 (PST)

I start from memcg because that gives us simple and clean, no heulistics
operation and we will not have ugly corner cases. And we can _expect_
that memcg has management daemon of OOM in other cgroup. Because memcg's memory
shortage never means "memory is exhausted", we can expect that daemon can work well.
Now, memcg has memory-usage-notifier file. oom-notifier will not be far differnet
from that.

cpuset should have its own if necessary. cpuset's difficulty is that
the memory on its nodes are _really_ exhausted and we're not sure
it can affecet management daemon at el...hang up.

BTW, concept of /dev/mem_notify is notify before OOM, not notify when OOM.
Now, memcg has memory-usage-notifier and that's implemented in some meaning.

Thanks,
-Kame

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 6:58 pm

What do you think about making pagefaults use out_of_memory() directly and 
respecting the sysctl_panic_on_oom settings?

This removes the check for a parallel memcg oom killing since we can 
guarantee that's not going to happen if we take ZONE_OOM_LOCKED for all 
populated zones (nobody is currently executing the oom killer) and no 
tasks have TIF_MEMDIE set.

Signed-off-by: David Rientjes <rientjes@google.com>
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
-extern bool mem_cgroup_oom_called(struct task_struct *task);
 void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
@@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
 	return true;
 }
 
-static inline bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	return false;
-}
-
 static inline int
 mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -200,7 +200,6 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	unsigned long	last_oom_jiffies;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -1234,34 +1233,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 	return total;
 }
 
-bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	bool ret = false;
-	struct mem_cgroup *mem;
-	struct mm_struct *mm;
-
-	rcu_read_lock();
-	mm = task->mm;
-	if (!mm)
-		mm = &init_mm;
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
-		ret = true;
-	rcu_read_unlock();
-	return ret;
-}
-
-static int record_last_oom_cb(struct ...
From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 7:13 pm

On Tue, 16 Feb 2010 17:58:05 -0800 (PST)

I don't think this patch is good. Because several memcg can
cause oom at the same time independently, system-wide oom locking is
unsuitable. BTW, what I doubt is much more fundamental thing.

What I doubt at most is "why VM_FAULT_OOM is necessary ? or why we have
to call oom_killer when page fault returns it".
Is there someone who returns VM_FAULT_OOM without calling page allocator
and oom-killer helps something in such situation ?

If returning VM_FAULT_OOM without caliing usual page allocator, oom-killer
will be never help, I guess.

If we don't have that, we don't have to implement pagefault_out_of_memory.

Hmm ?

Thanks,

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 7:23 pm

On Wed, 17 Feb 2010 11:13:19 +0900

And basically. memcg's oom means "the usage over the limits!!" and does
never means "resouce is exhausted!!".

Then, marking OOM to zones sounds strange. You can cause oom in 64MB memcg
in 64GB system.

Thanks,
-Kame

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 7:37 pm

ZONE_OOM_LOCKED is taken system-wide because the result of a memcg oom is 
that a task will get killed and free memory, so VM_FAULT_OOM doesn't 
require any additional killing if we're oom, it should just retry after 
the task has exited.  If we remove the zone locking for memcg, it is 
possible that pagefaults will race with setting TIF_MEMDIE and two tasks 
get killed instead.  I guess that's acceptable considering its just as 
likely that the memcg will reallocate to the same limit again and cause 
VM_FAULT_OOM to rekill.
--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 7:28 pm

We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
needlessly killing more than one task regardless of how many memcgs are 

Before we invoked the oom killer for VM_FAULT_OOM, we simply sent a 
SIGKILL to current because we simply don't have memory to fault the page 
in, it's better to select a memory-hogging task to kill based on badness() 
than to constantly kill current which may not help in the long term.
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 7:34 pm

On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
What I mean is
 - What VM_FAULT_OOM means is not "memory is exhausted" but "something is exhausted".

For example, when hugepages are all used, it may return VM_FAULT_OOM.
Especially when nr_overcommit_hugepage == usage_of_hugepage, it returns VM_FAULT_OOM.

Then, what oom-killer can help it ? I think never and the requester should die.

Before modifying current code, I think we have to check all VM_FAULT_OOM and distinguish
 - memory is exhausted (and page allocater wasn't called.)
 - something other than memory is exhausted.

And, in hugepage case, even order > PAGE_ALLOC_COSTLY_ORDER, oom-killer is
called and pagegault_oom_kill kills tasks randomly.

Thanks,
-Kame

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 7:58 pm

I've updated my patch to not take ZONE_OOM_LOCKED for any zones on memcg 
oom.  I'm hoping that you will add sysctl_panic_on_oom == 2 for this case 

The hugetlb case seems to be the only misuse of VM_FAULT_OOM where it 
doesn't mean we simply don't have the memory to handle the page fault, 
i.e. your earlier "memory is exhausted" definition.  That was handled well 
before calling out_of_memory() by simply killing current since we know it 
is faulting hugetlb pages and its resource is limited.

We could pass the vma to pagefault_out_of_memory() and simply kill current 
if its killable and is_vm_hugetlb_page(vma).
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 8:21 pm

On Tue, 16 Feb 2010 18:58:17 -0800 (PST)

No. hugepage is not only case.
You may not read but we annoyed i915's driver bug recently and it was clearly
misuse of VM_FAULT_OOM. Then, we got many reports of OOM killer in these months.
(thanks to Kosaki about this.)

quick glance around core codes...
 - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM.
 - filemap.c's VM_FAULT_OOM shoudn't call page_fault_oom_kill because it has already
   called oom_killer if it can. 
 - about relayfs, is VM_FAULT_OOM should be BUG_ON()...
 - filemap_xip.c return VM_FAULT_OOM....but it doesn't seem to be OOM..
   just like VM_FAULT_NO_VALID_PAGE_FOUND. (But I'm not familiar with this area.)
 - fs/buffer.c 's VM_FAULT_OOM is returned oom-killer is called.
 - shmem.c's VM_FAULT_OOM is retuned oom-killer is called.

i915's VM_FAULT_OOM is miterious but I can't find whether its real OOM or just shortage
of is own resource. I think VM_FAULT_NO_RESOUCE should be added.


Thanks,
-Kame


--

From: David Rientjes
Date: Wednesday, February 17, 2010 - 2:11 am

We can detect this with is_vm_hugetlb_page() if we pass the vma into 



The filemap, shmem, and block_prepare_write() cases will call the oom 
killer but, depending on the gfp mask, they will retry their allocations 
after the oom killer is called so we should never return VM_FAULT_OOM 
because they return -ENOMEM.  They fail from either small objsize slab 
allocations or with orders less than PAGE_ALLOC_COSTLY_ORDER which by 
default continues to retry even if direct reclaim fails.  If we're 
returning with VM_FAULT_OOM from these handlers, it should only be because 
of GFP_NOFS | __GFP_NORETRY or current has been oom killed and still can't 
find memory (so we don't care if the oom killer is called again since it 
won't kill anything else).

So like I said, I don't really see a need where VM_FAULT_NO_RESOURCE would 
be helpful in any case other than hugetlb which we can already detect by 
passing the vma into the pagefault oom handler.
--

From: Nick Piggin
Date: Wednesday, February 17, 2010 - 2:52 am

The real question is, what to do when returning to userspace. I don't
think there's a lot of options. SIGBUS is traditionally used for "no

SIGBUS, probably, yes. I questioned this as well but it mustn't

Yep. And yes you are right that we prefer to do the oom killing at the
allocation point where we know all the context, however the fact is that
VM_FAULT_OOM is an allowed part of the fault API so we have to handle it
somehow.

It can theoretically be called for valid reasons say if a driver or
arch page table has a high order allocation, or if the page allocator
implementation were to be changed.

We can't rightly just kill the task at this point, even if it has
invoked the oom killer, because it could have been marked as unkillable.

--

From: David Rientjes
Date: Wednesday, February 17, 2010 - 3:04 pm

For is_vm_hugetlb_page() in the pagefault oom handler, I think it should 
default to killing current as we did previously until that's worked out 
(and as some architectures like ia64 and powerpc still do).  In fact, 
pagefault ooms should probably always default to killing current if its 

That's easy to test in the oom handler, we can default to killing current 
but then kill another task if it is unkillable:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -696,15 +696,23 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
- * oom killing is already in progress so do nothing.  If a task is found with
- * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ * The pagefault handler calls here because it is out of memory, so kill current
+ * by default.  If it's unkillable, then fallback to killing a memory-hogging
+ * task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel oom killing is
+ * already in progress so do nothing.  If a task is found with TIF_MEMDIE set,
+ * it has been killed so do nothing and allow it to exit.
  */
 void pagefault_out_of_memory(void)
 {
+	unsigned long totalpages;
+	int err;
+
 	if (!try_set_system_oom())
 		return;
-	out_of_memory(NULL, 0, 0, NULL);
+	constrained_alloc(NULL, 0, NULL, &totalpages);
+	err = oom_kill_process(current, 0, 0, 0, totalpages, NULL,
+				"Out of memory (pagefault)"))
+	if (err)
+		out_of_memory(NULL, 0, 0, NULL);
 	clear_system_oom();
 }

We'll need to convert the architectures that still only issue a SIGKILL to 
current to use pagefault_out_of_memory() before OOM_DISABLE is fully 
respected across the kernel, though.
--

From: Daisuke Nishimura
Date: Sunday, February 21, 2010 - 10:31 pm

Hi.

It might be a bit off-topic, but memcg's check for last_oom_jiffies seems
not to work well under heavy load, and pagefault_out_of_memory() causes
global oom.

Step.1 make a memory cgroup directory and sed memory.limit_in_bytes to a small value

  > mkdir /cgroup/memory/test
  > echo 1M >/cgroup/memory/test/memory.limit_in_bytes

Stem.2 run attached test program(which allocates memory and does fork recursively)

  > ./recursive_fork -c 8 -s `expr 1 \* 1024 \* 1024`

This causes not only memcg's oom, but also global oom(My machine has 8 CPUS).

===
[348090.121808] recursive_fork3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
[348090.121821] recursive_fork3 cpuset=/ mems_allowed=0
[348090.121829] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
[348090.121832] Call Trace:
[348090.121849]  [<ffffffff810d6015>] oom_kill_process+0x86/0x295
[348090.121855]  [<ffffffff810d64cf>] ? select_bad_process+0x63/0xf0
[348090.121861]  [<ffffffff810d687a>] mem_cgroup_out_of_memory+0x69/0x87
[348090.121870]  [<ffffffff811119c2>] __mem_cgroup_try_charge+0x15f/0x1d4
[348090.121876]  [<ffffffff811126bc>] mem_cgroup_try_charge_swapin+0x104/0x159
[348090.121885]  [<ffffffff810edd9b>] handle_mm_fault+0x4ca/0x76c
[348090.121895]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
[348090.121904]  [<ffffffff81087286>] ? trace_hardirqs_on+0xd/0xf
[348090.121910]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
[348090.121915]  [<ffffffff8143431c>] do_page_fault+0x2be/0x2da
[348090.121922]  [<ffffffff81432115>] page_fault+0x25/0x30
[348090.121929] Task in /test killed as a result of limit of /test
[348090.121936] memory: usage 1024kB, limit 1024kB, failcnt 279335
[348090.121940] memory+swap: usage 4260kB, limit 9007199254740991kB, failcnt 0
[348090.121943] Mem-Info:
[348090.121947] Node 0 DMA per-cpu:
[348090.121952] CPU    0: hi:    0, btch:   1 usd:   0
[348090.121956] CPU    1: hi:    0, btch:   1 usd:   0
[348090.121960] CPU    2: hi:    0, btch:   1 usd:   ...
From: KAMEZAWA Hiroyuki
Date: Sunday, February 21, 2010 - 11:15 pm

On Mon, 22 Feb 2010 14:31:51 +0900

Yes, current design is not the best thing, my bad.
(I had to band-aid against unexpected panic in pagefault_out_of_memory.)

But tweaking that vaule seems not promissing.

Essential fix is better. The best fix is don't call oom-killer in
pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is
the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc..
So, please avoid quick fix. 

One thing I can think of is sleep-and-retry in try_charge() if PF_MEMDIE
is not set. (But..By this, memcg will never return faiulre in page fault.
but it may sound reasonable.)


Thanks,

--

From: Daisuke Nishimura
Date: Monday, February 22, 2010 - 4:42 am

On Mon, 22 Feb 2010 15:15:13 +0900
hmm, I can agree with you. But I think we need some trick to distinguish normal VM_FAULT_OOM
and memcg's VM_FAULT_OOM(the current itself was killed by memcg's oom, so exited the retry)
at mem_cgroup_oom_called() to avoid the system from panic when panic_on_oom is enabled.
(Mark the task which is being killed by memcg's oom ?).


Thanks,
Daisuke Nishimura.
--

From: David Rientjes
Date: Monday, February 22, 2010 - 1:59 pm

pagefault_out_of_memory() should use mem_cgroup_from_task(current) and 
then call mem_cgroup_out_of_memory() when it's non-NULL.  
select_bad_process() will return ERR_PTR(-1UL) if there is an already oom 
killed task attached to the memcg, so we can use that to avoid the 
panic_on_oom.  The setting of that sysctl doesn't imply that we can't scan 
the tasklist, it simply means we can't kill anything as a result of an 
oom.
--

From: KAMEZAWA Hiroyuki
Date: Monday, February 22, 2010 - 4:51 pm

On Mon, 22 Feb 2010 20:42:37 +0900

I'll prepare a patch today in other thread.

Thanks,
-Kame

--

From: David Rientjes
Date: Monday, February 22, 2010 - 1:55 pm

The last patch in my oom killer series defaults pagefault_out_of_memory() 
to always kill current first, if it's killable.  If it is unsuccessful, we 
fallback to scanning the entire tasklist.

For tasks that are constrained by a memcg, we could probably use 
mem_cgroup_from_task(current) and if it's non-NULL and non-root, call 
mem_cgroup_out_of_memory() with a gfp_mask of 0.  That would at least 
penalize the same memcg instead of invoking a global oom and would try the 
additional logic that you plan on adding to avoid killing any task at all 
in such conditions.
--

From: KOSAKI Motohiro
Date: Tuesday, February 16, 2010 - 7:19 pm

At least, I agree pagefault oom part. it need ZONE_OOM_LOCKED too.
if you make separated patch, I'll ack it. I don't know memcg part is 
correct or not.



--

From: Nick Piggin
Date: Monday, February 15, 2010 - 11:20 pm

What is the point of removing it, though? If it doesn't significantly
help some future patch, just leave it in. It's not worth breaking the
user/kernel interface just to remove 3 trivial lines of code.

--

From: David Rientjes
Date: Monday, February 15, 2010 - 11:59 pm

Because it is inconsistent at the user's expense, it has never panicked 
the machine for memory controller ooms, so why is a cpuset or mempolicy 
constrained oom conditions any different?  It also panics the machine even 
on VM_FAULT_OOM which is ridiculous, the tunable is certainly not being 
used how it was documented and so given the fact that mempolicy 
constrained ooms are now much smarter with my rewrite and we never simply 
kill current unless oom_kill_quick is enabled anymore, the compulsory 
panic_on_oom == 2 mode is no longer required.  Simply set all tasks 
attached to a cpuset or bound to a specific mempolicy to be OOM_DISABLE, 
the kernel need not provide confusing alternative modes to sysctls for 
this behavior.  Before panic_on_oom == 2 was introduced, it would have 
only panicked the machine if panic_on_oom was set to a non-zero integer, 
defining it be something different for '2' after it has held the same 
semantics for years is inappropriate.  There is just no concrete example 
that anyone can give where they want a cpuset-constrained oom to panic the 
machine when other tasks on a disjoint set of mems can continue to do 
work and the cpuset of interest cannot have its tasks set to OOM_DISABLE.
--

From: Nick Piggin
Date: Tuesday, February 16, 2010 - 12:20 am

Well memory controller was added later, wasn't it? So if you think
that's a bug then a fix to panic on memory controller ooms might



Well it was always defined in the documentation that it should be 0
or 1. Just that the limit wasn't enforced. I agree that's not ideal,
but anyway the existing and documented 0/1/2 has been there for 3 years

But this is changing the way the environment is required to set up. So
a kernel upgrade can break previously working setups. We don't do this
without really good reason.

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 12:53 am

But what about the existing memcg users who set panic_on_oom == 2 and 

Because the oom killer was never called for VM_FAULT_OOM before, we simply 
sent a SIGKILL to current, i.e. the original panic_on_oom semantics were 

It was meant to panic the machine anytime it was out of memory, regardless 
of the constraint, but that obviously doesn't match the memory controller 
case.  Just because cpusets and mempolicies decide to use the oom killer 
as a mechanism for enforcing a user-defined policy does not mean that we 
want to panic for them: mempolicies, for example, are user created and do 
not require any special capability.  Does it seem reasonable that an oom 
condition on those mempolicy nodes should panic the machine when killing 
the offender is possible (and perhaps even encouraged if the user sets a 
high /proc/pid/oom_score_adj?)  In other words, is an admin setting 
panic_on_oom == 2 really expecting that no application will use 
set_mempolicy() or do an mbind()?  This is a very error-prone interface 
that needs to be dealt with on a case-by-case basis and the perfect way to 
do that is by setting the affected tasks to be OOM_DISABLE; that 
interface, unlike panic_on_oom == 2, is very well understood by those with 
CAP_SYS_RESOURCE.
--

From: Nick Piggin
Date: Tuesday, February 16, 2010 - 1:08 am

But that was a bug in the addition of the memory controller. Either the

No but now they are. I don't know what your point is here because there
is no way the users of this interface can be expected to know about
VM_FAULT_OOM versus pagefault_out_of_memory let alone do anything useful

Right, and it's been like that for 3 years and people who don't use
the memory controller will be using that tunable.


I assume it is reasonable to want to panic on any OOM if you're after
fail-stop kind of behaviour. I guess that is why it was added. I see
more use for that case than panic_on_oom==1 case myself.

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 1:10 am

On Tue, 16 Feb 2010 19:08:17 +1100
I'll add a documentation to memcg. As

"When you exhaust memory resource under memcg, oom-killer may be invoked.
 But in this case, the system never panics even when panic_on_oom is set."

Maybe I should add "memcg_oom_notify (netlink message or file-decriptor or some".
Because memcg's oom is virtual oom, automatic management software can show
report to users and can do fail-over. I'll consider something useful for
memcg oom-fail-over instead of panic. In the simplest case, cgroup's notiifer
file descriptor can be used.

Thanks,
-Kame

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 1:42 am

The memory controller behavior seems intentional because it prevents 
panicking in two places: mem_cgroup_out_of_memory() never considers it and 
sysctl_panic_on_oom is preempted in pagefault_out_of_memory() if current's 
memcg is oom.

The documentation is currently right because it only mentions an 
application to cpusets and mempolicies.

That's the reason why I think we should eliminate it: it is completely 
bogus as it stands because it allows tasks to be killed in memory 
controller environments if their hard limit is reached unless they are set 
to OOM_DISABLE.  That doesn't have fail-stop behavior and trying to make 
exceptions to the rule is not true "fail-stop" that we need to preserve 

I think VM_FAULT_OOM should panic the machine for panic_on_oom == 1 as it 
presently does, it needs no special handling otherwise.  But this is an 
example of where semantics of panic_on_oom have changed in the past where 
OOM_DISABLE would remove any ambiguity.  Instead of redefining the 
sysctl's semantics everytime we add another usecase for the oom killer, 
why can't we just use a single interface that has been around for years 

I doubt you'll find much support from the memory controller folks on that 
since they probably won't agree this is fail-stop behavior and killing a 
task when constrained by a memcg is appropriate because the user asked for 
a hard limit.

Again, OOM_DISABLE would remove all ambiguity and we wouldn't need to 
concern ourselves of what the semantics of a poorly chosen interface such 

panic_on_oom == 1 is reasonable since no system task can make forward 
progress in allocating memory, that isn't necessarily true of cpuset or 
mempolicy (or memcg) constrained applications.  Other cpusets, for 
instance, can continue to do work uninterrupted and without threat of 
having one of their tasks being oom killed.
--

From: David Rientjes
Date: Monday, February 15, 2010 - 3:20 pm

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

The baseline for the heuristic is a proportion of memory that each task
is currently using in memory plus swap compared to the amount of
"allowable" memory.  "Allowable," in this sense, means the system-wide
resources for unconstrained oom conditions, the set of mempolicy nodes,
the mems attached to current's cpuset, or a memory controller's limit.
The proportion is given on a scale of 0 (never kill) to 1000 (always
kill), roughly meaning that if a task has a badness() score of 500 that
the task consumes approximately 50% of allowable memory resident in RAM
or in swap space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets
may operate in isolation; they shall not need to know the true size of
the machine on which they are running if they are bound to a specific set
of nodes or mems, respectively.

Forkbomb detection is done in a completely different way: a threshold is
configurable from userspace to determine how many first-generation execve
children (those with their own address spaces) a task may have before it
is considered a forkbomb.  This can be tuned by altering the value in
/proc/sys/vm/oom_forkbomb_thres, which defaults to 1000.

When a task has more than 1000 first-generation children with different
address spaces than itself, a penalty of

	(average rss of children) * (# of 1st generation execve children)
	-----------------------------------------------------------------
			oom_forkbomb_thres

is assessed.  So, for example, using the default oom_forkbomb_thres of
1000, the penalty is twice the average rss of all its ...
From: David Rientjes
Date: Monday, February 15, 2010 - 3:20 pm

Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
implemented for very large systems to avoid excessively long tasklist
scans.  The former suppresses helpful diagnostic messages that are
emitted for each thread group leader that are candidates for oom kill
including their pid, uid, vm size, rss, oom_adj value, and name; this
information is very helpful to users in understanding why a particular
task was chosen for kill over others.  The latter simply kills current,
the task triggering the oom condition, instead of iterating through the
tasklist looking for the worst offender.

Both of these sysctls are combined into one for use on the aforementioned
large systems: oom_kill_quick.  This disables the now-default
oom_dump_tasks and kills current whenever the oom killer is called.

The oom killer rewrite is the perfect opportunity to combine both sysctls
into one instead of carrying around the others for years to come for
nothing else than legacy purposes.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/sysctl/vm.txt |   44 +++++-------------------------------------
 include/linux/oom.h         |    3 +-
 kernel/sysctl.c             |   13 ++---------
 mm/oom_kill.c               |    9 +++----
 4 files changed, 14 insertions(+), 55 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
 - nr_pdflush_threads
 - nr_trim_pages         (only if CONFIG_MMU=n)
 - numa_zonelist_order
-- oom_dump_tasks
 - oom_forkbomb_thres
-- oom_kill_allocating_task
+- oom_kill_quick
 - overcommit_memory
 - overcommit_ratio
 - page-cluster
@@ -470,27 +469,6 @@ this is causing problems for your system/application.
 
 ==============================================================
 
-oom_dump_tasks
-
-Enables a system-wide task dump ...
From: Nick Piggin
Date: Monday, February 15, 2010 - 11:28 pm

I just don't understand this either. There appears to be simply no
--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 1:58 am

When oom_dump_tasks() is always emitted for out of memory conditions as my 
patch does, then these two tunables have the exact same audience: users 
with large systems that have extremely long tasklists.  They want to avoid 
tasklist scanning (either to select a bad process to kill or dump their 
information) in oom conditions and simply kill the allocating task.  I 
chose to combine the two: we're not concerned about breaking the 
oom_dump_tasks ABI since it's now the default behavior and since we scan 
the tasklist for mempolicy-constrained ooms, users may now choose to 
enable oom_kill_allocating_task when they previously wouldn't have.  To do 
that, they can either use the old sysctl or convert to this new sysctl 
with the benefit that we've removed one unnecessary sysctl from 
/proc/sys/vm.

As far as I know, oom_kill_allocating_task is only used by SGI, anyway, 
since they are the ones who asked for it when I implemented cpuset 
tasklist scanning.  It's certainly not widely used and since the semantics 
for mempolicies have changed, oom_kill_quick may find more users.
--

From: David Rientjes
Date: Monday, February 15, 2010 - 3:20 pm

Remove the redundancy in __oom_kill_task() since:

 - init can never be passed to this function: it will never be PF_EXITING
   or selectable from select_bad_process(), and

 - it will never be passed a task from oom_kill_task() without an ->mm
   and we're unconcerned about detachment from exiting tasks, there's no
   reason to protect them against SIGKILL or access to memory reserves.

Also moves the kernel log message to a higher level since the verbosity
is not always emitted here; we need not print an error message if an
exiting task is given a longer timeslice.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   64 ++++++++++++++------------------------------------------
 1 files changed, 16 insertions(+), 48 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -405,67 +405,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(mem);
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
-
 /*
- * Send SIGKILL to the selected  process irrespective of  CAP_SYS_RAW_IO
- * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
- * set.
+ * Give the oom killed task high priority and access to memory reserves so that
+ * it may quickly exit and free its memory.
  */
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static void __oom_kill_task(struct task_struct *p)
 {
-	if (is_global_init(p)) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill init!\n");
-		return;
-	}
-
-	task_lock(p);
-	if (!p->mm) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
-			task_pid_nr(p), p->comm);
-		task_unlock(p);
-		return;
-	}
-
-	if (verbose)
-		printk(KERN_ERR "Killed process %d (%s) "
-		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		       task_pid_nr(p), ...
From: David Rientjes
Date: Monday, February 15, 2010 - 3:20 pm

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions so that the task can
perhaps recover or try again later.  Killing current is an unnecessary
result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
unnecessary.

Previously, the heuristic provided some protection for those tasks with 
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1914,6 +1914,9 @@ rebalance:
 	 * running out of options and have to consider going OOM
 	 */
 	if (!did_some_progress) {
+		/* The oom killer won't necessarily free lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto nopage;
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			if (oom_killer_disabled)
 				goto ...
From: KAMEZAWA Hiroyuki
Date: Monday, February 15, 2010 - 4:57 pm

On Mon, 15 Feb 2010 14:20:21 -0800 (PST)

WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
plz.

Thanks,
-Kame

--

From: David Rientjes
Date: Monday, February 15, 2010 - 5:10 pm

As I already explained when you first brought this up, the possibility of 
not invoking the oom killer is not unique to GFP_DMA, it is also possible 
for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
We're not adding any additional __GFP_NOFAIL allocations.
--

From: KAMEZAWA Hiroyuki
Date: Monday, February 15, 2010 - 5:21 pm

On Mon, 15 Feb 2010 16:10:15 -0800 (PST)

Please add documentation about that to gfp.h before doing this.
Doing this without writing any documenation is laziness.
(WARNING is a style of documentation.)

Thanks,
-Kame


--

From: David Rientjes
Date: Monday, February 15, 2010 - 6:13 pm

This is already documented in the page allocator, but I guess doing it in 
include/linux/gfp.h as well doesn't hurt.



mm: add comment about deprecation of __GFP_NOFAIL

__GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new 
users should be added.

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/gfp.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -30,7 +30,8 @@ struct vm_area_struct;
  * _might_ fail.  This depends upon the particular VM implementation.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures.
+ * cannot handle allocation failures.  This modifier is deprecated and no new
+ * users should be added.
  *
  * __GFP_NORETRY: The VM implementation must not retry indefinitely.
  *
--

From: KAMEZAWA Hiroyuki
Date: Monday, February 15, 2010 - 6:26 pm

On Mon, 15 Feb 2010 17:13:57 -0800 (PST)
I want warning when someone uses OBSOLETE interface but...

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 12:03 am

I hope they don't use it with GFP_ATOMIC, either, because it's never been 
respected in that context.  We can easily audit the handful of cases in 
the kernel that use __GFP_NOFAIL (it takes five minutes at the max) and 
prove that none use it with GFP_ATOMIC or GFP_NOFS.  We don't need to add 
multitudes of warnings about using a deprecated flag with ludicrous 
combinations (does anyone really expect GFP_ATOMIC | __GFP_NOFAIL to work 
gracefully)?
--

From: Nick Piggin
Date: Tuesday, February 16, 2010 - 12:23 am

You don't need to add warnings, just don't break existing working
combinations and nobody has anything to complain about.

--

From: KOSAKI Motohiro
Date: Monday, February 15, 2010 - 10:32 pm

No current user? I don't think so.

	int bio_integrity_prep(struct bio *bio)
	{
	(snip)
	        buf = kmalloc(len, GFP_NOIO | __GFP_NOFAIL | q->bounce_gfp);

and 

	void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
	{
	(snip)
	        if (dma) {
	                init_emergency_isa_pool();
	                q->bounce_gfp = GFP_NOIO | GFP_DMA;
	                q->limits.bounce_pfn = b_pfn;
	        }



I don't like rumor based discussion, I like fact based one.

Thanks.



--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 12:29 am

The GFP_NOIO will prevent the oom killer from being called, it requires 
__GFP_FS.

I can change this to invoke the should_alloc_retry() logic by testing for 
!(gfp_mask & __GFP_NOFAIL), but there's nothing else the page allocator 
can currently do to increase its probability of allocating pages; the 
memory compaction patchset might be particularly helpful for these types 
of scenarios.
--

From: Nick Piggin
Date: Monday, February 15, 2010 - 11:44 pm

Completely agree with this request. Actually, I think even better you
should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
it is OK to break the API (callers *will* oops or corrupt memory if
__GFP_NOFAIL returns NULL).

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 12:41 am

... unless it's used with GFP_ATOMIC, which we've always returned NULL 
for when even ALLOC_HARDER can't find pages, right?

I'm wondering where this strong argument in favor of continuing to support 
__GFP_NOFAIL was when I insisted we call the oom killer for them even for 
allocations over PAGE_ALLOC_COSTLY_ORDER when __alloc_pages_nodemask() was 
refactored back in 2.6.31.  The argument was that nobody is allocating 
that high of orders of __GFP_NOFAIL pages so we don't need to free memory 
for them and that's where the deprecation of the modifier happened in the 
first place.  Ultimately, we did invoke the oom killer for those 
allocations because there's no chance of forward progress otherwise and, 
unlike __GFP_DMA, GFP_KERNEL | __GFP_NOFAIL actually is popular.  

I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
__GFP_NOFAIL) path since we're all content with endlessly looping.
--

From: Nick Piggin
Date: Tuesday, February 16, 2010 - 12:53 am

I don't know. IMO we should never just randomly weaken or break such

Thanks. Yes endlessly looping is far preferable to randomly oopsing
or corrupting memory.

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 1:25 am

Here's the new patch for your consideration.


oom: avoid oom killer for lowmem allocations

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with 
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1705,6 +1705,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		 */
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
+		/* The oom killer won't necessarily free lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
 	out_of_memory(zonelist, gfp_mask, order, nodemask);
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 4:48 pm

On Tue, 16 Feb 2010 00:25:22 -0800 (PST)

Then, can we take kdump in this endlessly looping situaton ?

panic_on_oom=always + kdump can do that. 

Thanks,
-Kame

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 5:03 pm

The endless loop is only helpful if something is going to free memory 
external to the current page allocation: either another task with 
__GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees 
memory, or a task that exits.

The most notable endless loop in the page allocator is the one when a task 
has been oom killed, gets access to memory reserves, and then cannot find 
a page for a __GFP_NOFAIL allocation:

	do {
		page = get_page_from_freelist(gfp_mask, nodemask, order,
			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
			preferred_zone, migratetype);

		if (!page && gfp_mask & __GFP_NOFAIL)
			congestion_wait(BLK_RW_ASYNC, HZ/50);
	} while (!page && (gfp_mask & __GFP_NOFAIL));

We don't expect any such allocations to happen during the exit path, but 
we could probably find some in the fs layer.

I don't want to check sysctl_panic_on_oom in the page allocator because it 
would start panicking the machine unnecessarily for the integrity 
metadata GFP_NOIO | __GFP_NOFAIL allocation, for any 
order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist 
for oom kill that wouldn't have panicked before.
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, February 16, 2010 - 5:03 pm

On Tue, 16 Feb 2010 16:03:23 -0800 (PST)

Then, why don't you check higzone_idx in oom_kill.c

Thanks,
-Kame

--

From: David Rientjes
Date: Tuesday, February 16, 2010 - 5:21 pm

out_of_memory() doesn't return a value to specify whether the page 
allocator should retry the allocation or just return NULL, all that policy 
is kept in mm/page_alloc.c.  For highzone_idx < ZONE_NORMAL, we want to 
fail the allocation when !(gfp_mask & __GFP_NOFAIL) and call the oom 
killer when it's __GFP_NOFAIL.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1696,6 +1696,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer will not help higher order allocs */
 		if (order > PAGE_ALLOC_COSTLY_ORDER)
 			goto out;
+		/* The OOM killer does not needlessly kill tasks for lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto out;
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -1924,15 +1927,23 @@ rebalance:
 			if (page)
 				goto got_pg;
 
-			/*
-			 * The OOM killer does not trigger for high-order
-			 * ~__GFP_NOFAIL allocations so if no progress is being
-			 * made, there are no other options and retrying is
-			 * unlikely to help.
-			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER &&
-						!(gfp_mask & __GFP_NOFAIL))
-				goto nopage;
+			if (!(gfp_mask & __GFP_NOFAIL)) {
+				/*
+				 * The oom killer is not called for high-order
+				 * allocations that may fail, so if no progress
+				 * is being made, there are no other options and
+				 * retrying is unlikely to help.
+				 */
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					goto nopage;
+				/*
+				 * The oom killer is not called for lowmem
+				 * allocations to prevent needlessly killing
+				 * innocent tasks.
+				 */
+				if (high_zoneidx < ZONE_NORMAL)
+					goto nopage;
+			}
 
 			goto restart;
 		}
--

From: Balbir Singh
Date: Tuesday, February 23, 2010 - 4:24 am

I am not sure if this is a good idea, ZONE_DMA could have a lot of
memory on some architectures. IIUC, we return NULL for allocations

-- 
	Three Cheers,
	Balbir
--

From: David Rientjes
Date: Tuesday, February 23, 2010 - 2:12 pm

As the patch description says, we would otherwise needlessly kill tasks 
that may not be consuming any lowmem since there is no way to determine 
its usage and typically the memory in lowmem will either be reclaimable 
(or migratable via memory compaction) if it is not pinned for I/O in which 
case we shouldn't kill for it anyway at this point.
--

From: David Rientjes
Date: Monday, February 15, 2010 - 3:20 pm

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed.  The target date for removal is December 2011.

A warning will be printed to the kernel log if a task attempts to use
this interface.  Future warning will be suppressed until the kernel is
rebooted to prevent spamming the kernel log.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/feature-removal-schedule.txt |   30 ++++++++++++++++++++++++++++
 Documentation/filesystems/proc.txt         |    3 ++
 fs/proc/base.c                             |    8 +++++++
 include/linux/oom.h                        |    3 ++
 4 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -168,6 +168,36 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
+What:	/proc/<pid>/oom_adj
+When:	December 2011
+Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
+	badness heuristic used to determine which task to kill when the kernel
+	is out of memory.
+
+	The badness heuristic has since been rewritten since the introduction of
+	this tunable such that its meaning is deprecated.  The value was
+	implemented as a bitshift on a score generated by the badness()
+	function that did not have any precise units of measure.  With the
+	rewrite, the score is given as a proportion of available memory to the
+	task allocating pages, so using a bitshift which grows the score
+	exponentially is, thus, impossible to tune with fine granularity.
+
+	A much more powerful interface, /proc/<pid>/oom_score_adj, was
+	introduced with the oom killer rewrite that allows users to increase or
+	decrease the badness() score linearly.  This interface will replace
+	/proc/<pid>/oom_adj.
+
+	See Documentation/filesystems/proc.txt for information on how to use the
+	new tunable.
+
+	A ...
From: Alan Cox
Date: Monday, February 15, 2010 - 3:28 pm

On Mon, 15 Feb 2010 14:20:16 -0800 (PST)

There are systems that rely on this feature. It's ABI, its sacred. We are
committed to it and it has users. That doesn't really detract from the
good/bad of the rest of the proposal, it's just one step we can't quite
make.

Alan
--

From: David Rientjes
Date: Monday, February 15, 2010 - 3:35 pm

Andrew suggested that it be deprecated in this way, so that's what was 
done.  I don't have any strong opinions about leaving it around forever 
now that it's otherwise unused beyond simply converting itself into units 
for /proc/pid/oom_score_adj at a much higher granularity.
--

From: David Rientjes
Date: Monday, February 15, 2010 - 3:20 pm

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mempolicy.h |   13 +++++++-
 mm/mempolicy.c            |   39 +++++++++++++++++++++++
 mm/oom_kill.c             |   77 +++++++++++++++++++++++++++-----------------
 3 files changed, 98 insertions(+), 31 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+				const nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+	return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+			const nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- ...
From: Balbir Singh
Date: Monday, February 22, 2010 - 11:31 pm

Seems reasonable, but I think it will require lots of testing.
-- 
	Three Cheers,
	Balbir
--

From: David Rientjes
Date: Tuesday, February 23, 2010 - 1:17 am

I already tested it by checking that tasks with very elevated oom_adj 
values don't get killed when they do not share the same MPOL_BIND nodes as 
a memory-hogging task.

What additional testing did you have in mind?
--

From: David Rientjes
Date: Monday, February 15, 2010 - 3:20 pm

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.

Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   12 +++---------
 1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
  */
 static int has_intersects_mems_allowed(struct task_struct *tsk)
 {
@@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		points /= 4;
 
 	/*
-	 * If p's nodes don't overlap ours, it may still help to kill p
-	 * because p may have allocated or otherwise mapped memory on
-	 * this node before. However it will be less likely.
-	 */
-	if (!has_intersects_mems_allowed(p))
-		points /= 8;
-
-	/*
 	 * Adjust the score by oom_adj.
 	 */
 	if (oom_adj) {
@@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
+		if (!has_intersects_mems_allowed(p))
+			continue;
 
 		/*
 		 * This task already has access to memory reserves and is
--

From: Nick Piggin
Date: Monday, February 15, 2010 - 11:14 pm

Previous thread: [PATCH] mtd: misplaced parenthesis in find_boot_record() by Roel Kluin on Monday, February 15, 2010 - 2:57 pm. (4 messages)

Next thread: [PATCH] dvb-usb/opera1: misplaced parenthesis by Roel Kluin on Monday, February 15, 2010 - 3:30 pm. (1 message)