This patchset is a rewrite of the out of memory killer to address several issues that have been raised recently. The most notable change is a complete rewrite of the badness heuristic that determines which task is killed; the goal was to make it as simple and predictable as possible while still addressing issues that plague the VM. Changes from version 1: - updated to mmotm-2010-02-11-21-55 - when iterating the tasklist for mempolicy-constrained oom conditions, the node of the cpu that a MPOL_F_LOCAL task is running on is now intersected with the page allocator's nodemask to determine whether it should be a candidate for oom kill. - added: [patch 4/9] oom: remove compulsory panic_on_oom mode - /proc/pid/oom_score_adj was added to prevent ABI breakage for applications using /proc/pid/oom_adj. /proc/pid/oom_adj may still be used with the old range but it is then scaled to oom_score_adj units for a rough linear approximation. There is no loss in functionality from the old interface. - added: [patch 6/9] oom: deprecate oom_adj tunable This patchset is based on mmotm-2010-02-11-21-55 because of the following dependencies: [patch 5/9] oom: badness heuristic rewrite: mm-count-swap-usage.patch [patch 7/9] oom: replace sysctls with quick mode: sysctl-clean-up-vm-related-variable-delcarations.patch To apply to mainline, download 2.6.33-rc8 and apply mm-clean-up-mm_counter.patch mm-avoid-false-sharing-of-mm_counter.patch mm-avoid-false_sharing-of-mm_counter-checkpatch-fixes.patch mm-count-swap-usage.patch mm-count-swap-usage-checkpatch-fixes.patch mm-introduce-dump_page-and-print-symbolic-flag-names.patch sysctl-clean-up-vm-related-variable-declarations.patch sysctl-clean-up-vm-related-variable-declarations-fix.patch from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first. --- Documentation/feature-removal-schedule.txt | 30 + Documentation/filesystems/proc.txt | 100 +++--- ...
When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.
Unfortunately, this often kills in a seemingly random fashion based on
the ordering of the selected task's child list. Additionally, it is not
guaranteed at all to free a large amount of memory that we need to
prevent additional oom killing in the very near future.
Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists. The worst child is indicated with the
highest badness() score. This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.
Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful
and then the parent if none were found while the new code simply kills
the most memory-hogging task or the parent. Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or
has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
cases, so the final oom_kill_task() will always succeed.
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 23 +++++++++++++++++------
1 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -432,7 +432,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
unsigned long points, struct mem_cgroup *mem,
const char *message)
{
+ struct task_struct *victim = p;
struct task_struct *c;
+ unsigned long victim_points = 0;
+ struct timespec uptime;
if ...Acked-by: Nick Piggin <npiggin@suse.de> --
If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic regardless of whether the memory allocation is constrained by either a mempolicy or cpuset. Since mempolicy-constrained out of memory conditions now iterate through the tasklist and select a task to kill, it is possible to panic the machine if all tasks sharing the same mempolicy nodes (including those with default policy, they may allocate anywhere) or cpuset mems have /proc/pid/oom_adj values of OOM_DISABLE. This is functionally equivalent to the compulsory panic_on_oom setting of 2, so the mode is removed. Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/sysctl/vm.txt | 20 ++++---------------- mm/oom_kill.c | 5 ----- 2 files changed, 4 insertions(+), 21 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -559,25 +559,13 @@ swap-intensive. panic_on_oom -This enables or disables panic on out-of-memory feature. +If this is set to zero, the oom killer will be invoked when the kernel is out of +memory and direct reclaim cannot free any pages. It will select a memory- +hogging task that frees up a large amount of memory to kill. -If this is set to 0, the kernel will kill some rogue process, -called oom_killer. Usually, oom_killer can kill rogue processes and -system will survive. - -If this is set to 1, the kernel panics when out-of-memory happens. -However, if a process limits using nodes by mempolicy/cpusets, -and those nodes become memory exhaustion status, one process -may be killed by oom-killer. No panic occurs in this case. -Because other nodes' memory may be free. This means system total status -may be not fatal yet. - -If this is set to 2, the kernel panics compulsorily even on the -above-mentioned. +If this is set to non-zero, the machine will panic when out of memory. The default value is 0. -1 and 2 are for failover of clustering. ...
On Mon, 15 Feb 2010 14:20:09 -0800 (PST) NACK. In an enviroment which depends on cluster-fail-over, this is useful even if in such situation. Thanks, -Kame --
You don't understand that the behavior has changed ever since mempolicy-constrained oom conditions are now affected by a compulsory panic_on_oom mode, please see the patch description. It's absolutely insane for a single sysctl mode to panic the machine anytime a cpuset or mempolicy runs out of memory and is more prone to user error from setting it without fully understanding the ramifications than any use it will ever do. The kernel already provides a mechanism for doing this, OOM_DISABLE. if you want your cpuset or mempolicy to risk panicking the machine, set all tasks that share its mems or nodes, respectively, to OOM_DISABLE. This is no different from the memory controller being immune to such panic_on_oom conditions, stop believing that it is the only mechanism used in the kernel to do memory isolation. --
On Mon, 15 Feb 2010 16:14:22 -0800 (PST) You don't explain why "we _have to_ remove API which is used" Thanks, -Kame --
First, I'm not stating that we _have_ to remove anything, this is a patch proposal that is open for review. Second, I believe we _should_ remove panic_on_oom == 2 because it's no longer being used as it was documented: as we've increased the exposure of the oom killer (memory controller, pagefault ooms, now mempolicy tasklist scanning), we constantly have to re-evaluate the semantics of this option while a well-understood tunable with a long history, OOM_DISABLE, already does the equivalent. The downside of getting this wrong is that the machine panics when it shouldn't have because of an unintended consequence of the mode being enabled (a mempolicy ooms, for example, that was created by the user). When reconsidering its semantics, I'd personally opt on the safe side and make sure the machine doesn't panic unnecessarily and instead require users to use OOM_DISABLE for tasks they do not want to be oom killed. --
On Tue, 16 Feb 2010 01:02:28 -0800 (PST) Please don't. I had a chance to talk with customer support team and talked about panic_on_oom briefly. I understood that panic_on_oom_alyways+kdump is the strongest tool for investigating customer's OOM situtation and do the best advice to them. panic_on_oom_always+kdump is the 100% information as snapshot when oom-killer happens. Then, it's easy to investigate and explain what is wront. They sometimes discover memory leak (by some prorietary driver) or miss-configuration of the system (as using unnecessary bounce buffer.) Then, please leave panic_on_oom=always. Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option. And yes, I'll add something similar to memcg. freeze_at_oom or something. Thanks, -Kame --
Ok, I'm not looking to cause your customers unnecessary grief by removing an option that they use, even though the same effect is possible by setting all tasks to OOM_DISABLE. I'll remove this patch in the next Memcg isn't a special case here, it should also panic the machine if panic_on_oom == 2, so if we aren't going to remove this option then I agree with Nick that we need to panic from mem_cgroup_out_of_memory() as well. Some users use cpusets, for example, for the same effect of memory isolation as you use memcg, so panicking in one scenario and not the other is inconsistent. --
On Tue, 16 Feb 2010 15:54:50 -0800 (PST)
Hmm, I have a few reason to add special behavior to memcg rather than panic.
- freeze_at_oom is enough.
If OOM can be notified, the management daemon can do useful jobs. Shutdown
all other cgroups or migrate them to other host and do kdump.
- memcg's oom is not very complicated.
Because we just counts RSS+FileCache
But, Hmm...I'd like to go this way.
1. At first, support panic_on_oom=2 in memcg.
2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
and don't call memcg_out_of_memory in oom_kill.c in this case. Because
we don't kill anything. Taking coredumps of all procs in memcg is not
very difficult.
I need to discuss with memcg guys. But this will be a way to go, I think
Thanks,
-Kame
--
This should panic in mem_cgroup_out_of_memory() and the documentation should be added to Documentation/sysctl/vm.txt. The memory controller also has some protection in the pagefault oom handler that seems like it could be made more general: instead of checking for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for already oom killed task (checking for the TIF_MEMDIE bit) and check all zones for ZONE_OOM_LOCKED. If no oom killed tasks are found and no zones are locked, we can check sysctl_panic_on_oom and invoke the system-wide The oom notifier would be at a higher level than the oom killer, the oom killer's job is simply to kill a task when it is called. So for these particular cases, you would never even call into out_of_memory() to panic the machine in the first place. Hopefully, the oom notifier can be made to be more generic as its own cgroup rather than only being used by memcg, but if such a userspace notifier would defer to the kernel oom killer, it should panic when panic_on_oom == 2 is selected regardless of whether it is constrained or not. Thus, we can keep the sysctl_panic_on_oom logic in the oom killer (both in out_of_memory() and mem_cgroup_out_of_memory()) without risk of unnecessarily panic whenever an oom notifier or freeze_at_oom setting intercepts the condition. --
On Tue, 16 Feb 2010 16:31:39 -0800 (PST) cpuset's difficulty is that there are some methods which share the limitation. It's not simple that we have - cpuset - mempolicy per task - mempolicy per vma plz remove memcg's hook after doing that. Current implemantation is desgined That's my point. Thanks, -Kame --
Ok, I'll eliminate pagefault_out_of_memory() and get it to use out_of_memory() by only checking for constrained_alloc() when Great, are you planning on implementing a cgroup that is based on roughly on the /dev/mem_notify patchset so userspace can poll() a file and be notified of oom events? It would help beyond just memcg, it has an application to cpusets (adding more mems on large systems) as well. It can also be used purely to preempt the kernel oom killer and move all the policy to userspace even though it would be sacrificing TIF_MEMDIE. --
On Tue, 16 Feb 2010 16:54:31 -0800 (PST) I start from memcg because that gives us simple and clean, no heulistics operation and we will not have ugly corner cases. And we can _expect_ that memcg has management daemon of OOM in other cgroup. Because memcg's memory shortage never means "memory is exhausted", we can expect that daemon can work well. Now, memcg has memory-usage-notifier file. oom-notifier will not be far differnet from that. cpuset should have its own if necessary. cpuset's difficulty is that the memory on its nodes are _really_ exhausted and we're not sure it can affecet management daemon at el...hang up. BTW, concept of /dev/mem_notify is notify before OOM, not notify when OOM. Now, memcg has memory-usage-notifier and that's implemented in some meaning. Thanks, -Kame --
What do you think about making pagefaults use out_of_memory() directly and
respecting the sysctl_panic_on_oom settings?
This removes the check for a parallel memcg oom killing since we can
guarantee that's not going to happen if we take ZONE_OOM_LOCKED for all
populated zones (nobody is currently executing the oom killer) and no
tasks have TIF_MEMDIE set.
Signed-off-by: David Rientjes <rientjes@google.com>
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
return false;
}
-extern bool mem_cgroup_oom_called(struct task_struct *task);
void mem_cgroup_update_file_mapped(struct page *page, int val);
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
@@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
return true;
}
-static inline bool mem_cgroup_oom_called(struct task_struct *task)
-{
- return false;
-}
-
static inline int
mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -200,7 +200,6 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
- unsigned long last_oom_jiffies;
atomic_t refcnt;
unsigned int swappiness;
@@ -1234,34 +1233,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
return total;
}
-bool mem_cgroup_oom_called(struct task_struct *task)
-{
- bool ret = false;
- struct mem_cgroup *mem;
- struct mm_struct *mm;
-
- rcu_read_lock();
- mm = task->mm;
- if (!mm)
- mm = &init_mm;
- mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
- if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
- ret = true;
- rcu_read_unlock();
- return ret;
-}
-
-static int record_last_oom_cb(struct ...On Tue, 16 Feb 2010 17:58:05 -0800 (PST) I don't think this patch is good. Because several memcg can cause oom at the same time independently, system-wide oom locking is unsuitable. BTW, what I doubt is much more fundamental thing. What I doubt at most is "why VM_FAULT_OOM is necessary ? or why we have to call oom_killer when page fault returns it". Is there someone who returns VM_FAULT_OOM without calling page allocator and oom-killer helps something in such situation ? If returning VM_FAULT_OOM without caliing usual page allocator, oom-killer will be never help, I guess. If we don't have that, we don't have to implement pagefault_out_of_memory. Hmm ? Thanks, --
On Wed, 17 Feb 2010 11:13:19 +0900 And basically. memcg's oom means "the usage over the limits!!" and does never means "resouce is exhausted!!". Then, marking OOM to zones sounds strange. You can cause oom in 64MB memcg in 64GB system. Thanks, -Kame --
ZONE_OOM_LOCKED is taken system-wide because the result of a memcg oom is that a task will get killed and free memory, so VM_FAULT_OOM doesn't require any additional killing if we're oom, it should just retry after the task has exited. If we remove the zone locking for memcg, it is possible that pagefaults will race with setting TIF_MEMDIE and two tasks get killed instead. I guess that's acceptable considering its just as likely that the memcg will reallocate to the same limit again and cause VM_FAULT_OOM to rekill. --
We want to lock all populated zones with ZONE_OOM_LOCKED to avoid needlessly killing more than one task regardless of how many memcgs are Before we invoked the oom killer for VM_FAULT_OOM, we simply sent a SIGKILL to current because we simply don't have memory to fault the page in, it's better to select a memory-hogging task to kill based on badness() than to constantly kill current which may not help in the long term. --
On Tue, 16 Feb 2010 18:28:05 -0800 (PST) What I mean is - What VM_FAULT_OOM means is not "memory is exhausted" but "something is exhausted". For example, when hugepages are all used, it may return VM_FAULT_OOM. Especially when nr_overcommit_hugepage == usage_of_hugepage, it returns VM_FAULT_OOM. Then, what oom-killer can help it ? I think never and the requester should die. Before modifying current code, I think we have to check all VM_FAULT_OOM and distinguish - memory is exhausted (and page allocater wasn't called.) - something other than memory is exhausted. And, in hugepage case, even order > PAGE_ALLOC_COSTLY_ORDER, oom-killer is called and pagegault_oom_kill kills tasks randomly. Thanks, -Kame --
I've updated my patch to not take ZONE_OOM_LOCKED for any zones on memcg oom. I'm hoping that you will add sysctl_panic_on_oom == 2 for this case The hugetlb case seems to be the only misuse of VM_FAULT_OOM where it doesn't mean we simply don't have the memory to handle the page fault, i.e. your earlier "memory is exhausted" definition. That was handled well before calling out_of_memory() by simply killing current since we know it is faulting hugetlb pages and its resource is limited. We could pass the vma to pagefault_out_of_memory() and simply kill current if its killable and is_vm_hugetlb_page(vma). --
On Tue, 16 Feb 2010 18:58:17 -0800 (PST) No. hugepage is not only case. You may not read but we annoyed i915's driver bug recently and it was clearly misuse of VM_FAULT_OOM. Then, we got many reports of OOM killer in these months. (thanks to Kosaki about this.) quick glance around core codes... - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM. - filemap.c's VM_FAULT_OOM shoudn't call page_fault_oom_kill because it has already called oom_killer if it can. - about relayfs, is VM_FAULT_OOM should be BUG_ON()... - filemap_xip.c return VM_FAULT_OOM....but it doesn't seem to be OOM.. just like VM_FAULT_NO_VALID_PAGE_FOUND. (But I'm not familiar with this area.) - fs/buffer.c 's VM_FAULT_OOM is returned oom-killer is called. - shmem.c's VM_FAULT_OOM is retuned oom-killer is called. i915's VM_FAULT_OOM is miterious but I can't find whether its real OOM or just shortage of is own resource. I think VM_FAULT_NO_RESOUCE should be added. Thanks, -Kame --
We can detect this with is_vm_hugetlb_page() if we pass the vma into The filemap, shmem, and block_prepare_write() cases will call the oom killer but, depending on the gfp mask, they will retry their allocations after the oom killer is called so we should never return VM_FAULT_OOM because they return -ENOMEM. They fail from either small objsize slab allocations or with orders less than PAGE_ALLOC_COSTLY_ORDER which by default continues to retry even if direct reclaim fails. If we're returning with VM_FAULT_OOM from these handlers, it should only be because of GFP_NOFS | __GFP_NORETRY or current has been oom killed and still can't find memory (so we don't care if the oom killer is called again since it won't kill anything else). So like I said, I don't really see a need where VM_FAULT_NO_RESOURCE would be helpful in any case other than hugetlb which we can already detect by passing the vma into the pagefault oom handler. --
The real question is, what to do when returning to userspace. I don't think there's a lot of options. SIGBUS is traditionally used for "no SIGBUS, probably, yes. I questioned this as well but it mustn't Yep. And yes you are right that we prefer to do the oom killing at the allocation point where we know all the context, however the fact is that VM_FAULT_OOM is an allowed part of the fault API so we have to handle it somehow. It can theoretically be called for valid reasons say if a driver or arch page table has a high order allocation, or if the page allocator implementation were to be changed. We can't rightly just kill the task at this point, even if it has invoked the oom killer, because it could have been marked as unkillable. --
For is_vm_hugetlb_page() in the pagefault oom handler, I think it should
default to killing current as we did previously until that's worked out
(and as some architectures like ia64 and powerpc still do). In fact,
pagefault ooms should probably always default to killing current if its
That's easy to test in the oom handler, we can default to killing current
but then kill another task if it is unkillable:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -696,15 +696,23 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
}
/*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a parallel
- * oom killing is already in progress so do nothing. If a task is found with
- * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ * The pagefault handler calls here because it is out of memory, so kill current
+ * by default. If it's unkillable, then fallback to killing a memory-hogging
+ * task. If a populated zone has ZONE_OOM_LOCKED set, a parallel oom killing is
+ * already in progress so do nothing. If a task is found with TIF_MEMDIE set,
+ * it has been killed so do nothing and allow it to exit.
*/
void pagefault_out_of_memory(void)
{
+ unsigned long totalpages;
+ int err;
+
if (!try_set_system_oom())
return;
- out_of_memory(NULL, 0, 0, NULL);
+ constrained_alloc(NULL, 0, NULL, &totalpages);
+ err = oom_kill_process(current, 0, 0, 0, totalpages, NULL,
+ "Out of memory (pagefault)"))
+ if (err)
+ out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();
}
We'll need to convert the architectures that still only issue a SIGKILL to
current to use pagefault_out_of_memory() before OOM_DISABLE is fully
respected across the kernel, though.
--
Hi. It might be a bit off-topic, but memcg's check for last_oom_jiffies seems not to work well under heavy load, and pagefault_out_of_memory() causes global oom. Step.1 make a memory cgroup directory and sed memory.limit_in_bytes to a small value > mkdir /cgroup/memory/test > echo 1M >/cgroup/memory/test/memory.limit_in_bytes Stem.2 run attached test program(which allocates memory and does fork recursively) > ./recursive_fork -c 8 -s `expr 1 \* 1024 \* 1024` This causes not only memcg's oom, but also global oom(My machine has 8 CPUS). === [348090.121808] recursive_fork3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0 [348090.121821] recursive_fork3 cpuset=/ mems_allowed=0 [348090.121829] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3 [348090.121832] Call Trace: [348090.121849] [<ffffffff810d6015>] oom_kill_process+0x86/0x295 [348090.121855] [<ffffffff810d64cf>] ? select_bad_process+0x63/0xf0 [348090.121861] [<ffffffff810d687a>] mem_cgroup_out_of_memory+0x69/0x87 [348090.121870] [<ffffffff811119c2>] __mem_cgroup_try_charge+0x15f/0x1d4 [348090.121876] [<ffffffff811126bc>] mem_cgroup_try_charge_swapin+0x104/0x159 [348090.121885] [<ffffffff810edd9b>] handle_mm_fault+0x4ca/0x76c [348090.121895] [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da [348090.121904] [<ffffffff81087286>] ? trace_hardirqs_on+0xd/0xf [348090.121910] [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da [348090.121915] [<ffffffff8143431c>] do_page_fault+0x2be/0x2da [348090.121922] [<ffffffff81432115>] page_fault+0x25/0x30 [348090.121929] Task in /test killed as a result of limit of /test [348090.121936] memory: usage 1024kB, limit 1024kB, failcnt 279335 [348090.121940] memory+swap: usage 4260kB, limit 9007199254740991kB, failcnt 0 [348090.121943] Mem-Info: [348090.121947] Node 0 DMA per-cpu: [348090.121952] CPU 0: hi: 0, btch: 1 usd: 0 [348090.121956] CPU 1: hi: 0, btch: 1 usd: 0 [348090.121960] CPU 2: hi: 0, btch: 1 usd: ...
On Mon, 22 Feb 2010 14:31:51 +0900 Yes, current design is not the best thing, my bad. (I had to band-aid against unexpected panic in pagefault_out_of_memory.) But tweaking that vaule seems not promissing. Essential fix is better. The best fix is don't call oom-killer in pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc.. So, please avoid quick fix. One thing I can think of is sleep-and-retry in try_charge() if PF_MEMDIE is not set. (But..By this, memcg will never return faiulre in page fault. but it may sound reasonable.) Thanks, --
On Mon, 22 Feb 2010 15:15:13 +0900 hmm, I can agree with you. But I think we need some trick to distinguish normal VM_FAULT_OOM and memcg's VM_FAULT_OOM(the current itself was killed by memcg's oom, so exited the retry) at mem_cgroup_oom_called() to avoid the system from panic when panic_on_oom is enabled. (Mark the task which is being killed by memcg's oom ?). Thanks, Daisuke Nishimura. --
pagefault_out_of_memory() should use mem_cgroup_from_task(current) and then call mem_cgroup_out_of_memory() when it's non-NULL. select_bad_process() will return ERR_PTR(-1UL) if there is an already oom killed task attached to the memcg, so we can use that to avoid the panic_on_oom. The setting of that sysctl doesn't imply that we can't scan the tasklist, it simply means we can't kill anything as a result of an oom. --
On Mon, 22 Feb 2010 20:42:37 +0900 I'll prepare a patch today in other thread. Thanks, -Kame --
The last patch in my oom killer series defaults pagefault_out_of_memory() to always kill current first, if it's killable. If it is unsuccessful, we fallback to scanning the entire tasklist. For tasks that are constrained by a memcg, we could probably use mem_cgroup_from_task(current) and if it's non-NULL and non-root, call mem_cgroup_out_of_memory() with a gfp_mask of 0. That would at least penalize the same memcg instead of invoking a global oom and would try the additional logic that you plan on adding to avoid killing any task at all in such conditions. --
At least, I agree pagefault oom part. it need ZONE_OOM_LOCKED too. if you make separated patch, I'll ack it. I don't know memcg part is correct or not. --
What is the point of removing it, though? If it doesn't significantly help some future patch, just leave it in. It's not worth breaking the user/kernel interface just to remove 3 trivial lines of code. --
Because it is inconsistent at the user's expense, it has never panicked the machine for memory controller ooms, so why is a cpuset or mempolicy constrained oom conditions any different? It also panics the machine even on VM_FAULT_OOM which is ridiculous, the tunable is certainly not being used how it was documented and so given the fact that mempolicy constrained ooms are now much smarter with my rewrite and we never simply kill current unless oom_kill_quick is enabled anymore, the compulsory panic_on_oom == 2 mode is no longer required. Simply set all tasks attached to a cpuset or bound to a specific mempolicy to be OOM_DISABLE, the kernel need not provide confusing alternative modes to sysctls for this behavior. Before panic_on_oom == 2 was introduced, it would have only panicked the machine if panic_on_oom was set to a non-zero integer, defining it be something different for '2' after it has held the same semantics for years is inappropriate. There is just no concrete example that anyone can give where they want a cpuset-constrained oom to panic the machine when other tasks on a disjoint set of mems can continue to do work and the cpuset of interest cannot have its tasks set to OOM_DISABLE. --
Well memory controller was added later, wasn't it? So if you think that's a bug then a fix to panic on memory controller ooms might Well it was always defined in the documentation that it should be 0 or 1. Just that the limit wasn't enforced. I agree that's not ideal, but anyway the existing and documented 0/1/2 has been there for 3 years But this is changing the way the environment is required to set up. So a kernel upgrade can break previously working setups. We don't do this without really good reason. --
But what about the existing memcg users who set panic_on_oom == 2 and Because the oom killer was never called for VM_FAULT_OOM before, we simply sent a SIGKILL to current, i.e. the original panic_on_oom semantics were It was meant to panic the machine anytime it was out of memory, regardless of the constraint, but that obviously doesn't match the memory controller case. Just because cpusets and mempolicies decide to use the oom killer as a mechanism for enforcing a user-defined policy does not mean that we want to panic for them: mempolicies, for example, are user created and do not require any special capability. Does it seem reasonable that an oom condition on those mempolicy nodes should panic the machine when killing the offender is possible (and perhaps even encouraged if the user sets a high /proc/pid/oom_score_adj?) In other words, is an admin setting panic_on_oom == 2 really expecting that no application will use set_mempolicy() or do an mbind()? This is a very error-prone interface that needs to be dealt with on a case-by-case basis and the perfect way to do that is by setting the affected tasks to be OOM_DISABLE; that interface, unlike panic_on_oom == 2, is very well understood by those with CAP_SYS_RESOURCE. --
But that was a bug in the addition of the memory controller. Either the No but now they are. I don't know what your point is here because there is no way the users of this interface can be expected to know about VM_FAULT_OOM versus pagefault_out_of_memory let alone do anything useful Right, and it's been like that for 3 years and people who don't use the memory controller will be using that tunable. I assume it is reasonable to want to panic on any OOM if you're after fail-stop kind of behaviour. I guess that is why it was added. I see more use for that case than panic_on_oom==1 case myself. --
On Tue, 16 Feb 2010 19:08:17 +1100 I'll add a documentation to memcg. As "When you exhaust memory resource under memcg, oom-killer may be invoked. But in this case, the system never panics even when panic_on_oom is set." Maybe I should add "memcg_oom_notify (netlink message or file-decriptor or some". Because memcg's oom is virtual oom, automatic management software can show report to users and can do fail-over. I'll consider something useful for memcg oom-fail-over instead of panic. In the simplest case, cgroup's notiifer file descriptor can be used. Thanks, -Kame --
The memory controller behavior seems intentional because it prevents panicking in two places: mem_cgroup_out_of_memory() never considers it and sysctl_panic_on_oom is preempted in pagefault_out_of_memory() if current's memcg is oom. The documentation is currently right because it only mentions an application to cpusets and mempolicies. That's the reason why I think we should eliminate it: it is completely bogus as it stands because it allows tasks to be killed in memory controller environments if their hard limit is reached unless they are set to OOM_DISABLE. That doesn't have fail-stop behavior and trying to make exceptions to the rule is not true "fail-stop" that we need to preserve I think VM_FAULT_OOM should panic the machine for panic_on_oom == 1 as it presently does, it needs no special handling otherwise. But this is an example of where semantics of panic_on_oom have changed in the past where OOM_DISABLE would remove any ambiguity. Instead of redefining the sysctl's semantics everytime we add another usecase for the oom killer, why can't we just use a single interface that has been around for years I doubt you'll find much support from the memory controller folks on that since they probably won't agree this is fail-stop behavior and killing a task when constrained by a memcg is appropriate because the user asked for a hard limit. Again, OOM_DISABLE would remove all ambiguity and we wouldn't need to concern ourselves of what the semantics of a poorly chosen interface such panic_on_oom == 1 is reasonable since no system task can make forward progress in allocating memory, that isn't necessarily true of cpuset or mempolicy (or memcg) constrained applications. Other cpusets, for instance, can continue to do work uninterrupted and without threat of having one of their tasks being oom killed. --
This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Forkbomb detection is done in a completely different way: a threshold is configurable from userspace to determine how many first-generation execve children (those with their own address spaces) a task may have before it is considered a forkbomb. This can be tuned by altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to 1000. When a task has more than 1000 first-generation children with different address spaces than itself, a penalty of (average rss of children) * (# of 1st generation execve children) ----------------------------------------------------------------- oom_forkbomb_thres is assessed. So, for example, using the default oom_forkbomb_thres of 1000, the penalty is twice the average rss of all its ...
Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were implemented for very large systems to avoid excessively long tasklist scans. The former suppresses helpful diagnostic messages that are emitted for each thread group leader that are candidates for oom kill including their pid, uid, vm size, rss, oom_adj value, and name; this information is very helpful to users in understanding why a particular task was chosen for kill over others. The latter simply kills current, the task triggering the oom condition, instead of iterating through the tasklist looking for the worst offender. Both of these sysctls are combined into one for use on the aforementioned large systems: oom_kill_quick. This disables the now-default oom_dump_tasks and kills current whenever the oom killer is called. The oom killer rewrite is the perfect opportunity to combine both sysctls into one instead of carrying around the others for years to come for nothing else than legacy purposes. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/sysctl/vm.txt | 44 +++++------------------------------------- include/linux/oom.h | 3 +- kernel/sysctl.c | 13 ++--------- mm/oom_kill.c | 9 +++---- 4 files changed, 14 insertions(+), 55 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm: - nr_pdflush_threads - nr_trim_pages (only if CONFIG_MMU=n) - numa_zonelist_order -- oom_dump_tasks - oom_forkbomb_thres -- oom_kill_allocating_task +- oom_kill_quick - overcommit_memory - overcommit_ratio - page-cluster @@ -470,27 +469,6 @@ this is causing problems for your system/application. ============================================================== -oom_dump_tasks - -Enables a system-wide task dump ...
I just don't understand this either. There appears to be simply no --
When oom_dump_tasks() is always emitted for out of memory conditions as my patch does, then these two tunables have the exact same audience: users with large systems that have extremely long tasklists. They want to avoid tasklist scanning (either to select a bad process to kill or dump their information) in oom conditions and simply kill the allocating task. I chose to combine the two: we're not concerned about breaking the oom_dump_tasks ABI since it's now the default behavior and since we scan the tasklist for mempolicy-constrained ooms, users may now choose to enable oom_kill_allocating_task when they previously wouldn't have. To do that, they can either use the old sysctl or convert to this new sysctl with the benefit that we've removed one unnecessary sysctl from /proc/sys/vm. As far as I know, oom_kill_allocating_task is only used by SGI, anyway, since they are the ones who asked for it when I implemented cpuset tasklist scanning. It's certainly not widely used and since the semantics for mempolicies have changed, oom_kill_quick may find more users. --
Remove the redundancy in __oom_kill_task() since:
- init can never be passed to this function: it will never be PF_EXITING
or selectable from select_bad_process(), and
- it will never be passed a task from oom_kill_task() without an ->mm
and we're unconcerned about detachment from exiting tasks, there's no
reason to protect them against SIGKILL or access to memory reserves.
Also moves the kernel log message to a higher level since the verbosity
is not always emitted here; we need not print an error message if an
exiting task is given a longer timeslice.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 64 ++++++++++++++------------------------------------------
1 files changed, 16 insertions(+), 48 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -405,67 +405,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
dump_tasks(mem);
}
-#define K(x) ((x) << (PAGE_SHIFT-10))
-
/*
- * Send SIGKILL to the selected process irrespective of CAP_SYS_RAW_IO
- * flag though it's unlikely that we select a process with CAP_SYS_RAW_IO
- * set.
+ * Give the oom killed task high priority and access to memory reserves so that
+ * it may quickly exit and free its memory.
*/
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static void __oom_kill_task(struct task_struct *p)
{
- if (is_global_init(p)) {
- WARN_ON(1);
- printk(KERN_WARNING "tried to kill init!\n");
- return;
- }
-
- task_lock(p);
- if (!p->mm) {
- WARN_ON(1);
- printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
- task_pid_nr(p), p->comm);
- task_unlock(p);
- return;
- }
-
- if (verbose)
- printk(KERN_ERR "Killed process %d (%s) "
- "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
- task_pid_nr(p), ...If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help. The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O. Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.
Lowmem allocations are now failed in oom conditions so that the task can
perhaps recover or try again later. Killing current is an unnecessary
result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
unnecessary.
Previously, the heuristic provided some protection for those tasks with
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.
high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/page_alloc.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1914,6 +1914,9 @@ rebalance:
* running out of options and have to consider going OOM
*/
if (!did_some_progress) {
+ /* The oom killer won't necessarily free lowmem */
+ if (high_zoneidx < ZONE_NORMAL)
+ goto nopage;
if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
if (oom_killer_disabled)
goto ...On Mon, 15 Feb 2010 14:20:21 -0800 (PST) WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL)) plz. Thanks, -Kame --
As I already explained when you first brought this up, the possibility of not invoking the oom killer is not unique to GFP_DMA, it is also possible for GFP_NOFS. Since __GFP_NOFAIL is deprecated and there are no current users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary. We're not adding any additional __GFP_NOFAIL allocations. --
On Mon, 15 Feb 2010 16:10:15 -0800 (PST) Please add documentation about that to gfp.h before doing this. Doing this without writing any documenation is laziness. (WARNING is a style of documentation.) Thanks, -Kame --
This is already documented in the page allocator, but I guess doing it in include/linux/gfp.h as well doesn't hurt. mm: add comment about deprecation of __GFP_NOFAIL __GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new users should be added. Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/gfp.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -30,7 +30,8 @@ struct vm_area_struct; * _might_ fail. This depends upon the particular VM implementation. * * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller - * cannot handle allocation failures. + * cannot handle allocation failures. This modifier is deprecated and no new + * users should be added. * * __GFP_NORETRY: The VM implementation must not retry indefinitely. * --
On Mon, 15 Feb 2010 17:13:57 -0800 (PST) I want warning when someone uses OBSOLETE interface but... Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to --
I hope they don't use it with GFP_ATOMIC, either, because it's never been respected in that context. We can easily audit the handful of cases in the kernel that use __GFP_NOFAIL (it takes five minutes at the max) and prove that none use it with GFP_ATOMIC or GFP_NOFS. We don't need to add multitudes of warnings about using a deprecated flag with ludicrous combinations (does anyone really expect GFP_ATOMIC | __GFP_NOFAIL to work gracefully)? --
You don't need to add warnings, just don't break existing working combinations and nobody has anything to complain about. --
No current user? I don't think so.
int bio_integrity_prep(struct bio *bio)
{
(snip)
buf = kmalloc(len, GFP_NOIO | __GFP_NOFAIL | q->bounce_gfp);
and
void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
{
(snip)
if (dma) {
init_emergency_isa_pool();
q->bounce_gfp = GFP_NOIO | GFP_DMA;
q->limits.bounce_pfn = b_pfn;
}
I don't like rumor based discussion, I like fact based one.
Thanks.
--
The GFP_NOIO will prevent the oom killer from being called, it requires __GFP_FS. I can change this to invoke the should_alloc_retry() logic by testing for !(gfp_mask & __GFP_NOFAIL), but there's nothing else the page allocator can currently do to increase its probability of allocating pages; the memory compaction patchset might be particularly helpful for these types of scenarios. --
Completely agree with this request. Actually, I think even better you should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean it is OK to break the API (callers *will* oops or corrupt memory if __GFP_NOFAIL returns NULL). --
... unless it's used with GFP_ATOMIC, which we've always returned NULL for when even ALLOC_HARDER can't find pages, right? I'm wondering where this strong argument in favor of continuing to support __GFP_NOFAIL was when I insisted we call the oom killer for them even for allocations over PAGE_ALLOC_COSTLY_ORDER when __alloc_pages_nodemask() was refactored back in 2.6.31. The argument was that nobody is allocating that high of orders of __GFP_NOFAIL pages so we don't need to free memory for them and that's where the deprecation of the modifier happened in the first place. Ultimately, we did invoke the oom killer for those allocations because there's no chance of forward progress otherwise and, unlike __GFP_DMA, GFP_KERNEL | __GFP_NOFAIL actually is popular. I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & __GFP_NOFAIL) path since we're all content with endlessly looping. --
I don't know. IMO we should never just randomly weaken or break such Thanks. Yes endlessly looping is far preferable to randomly oopsing or corrupting memory. --
Here's the new patch for your consideration. oom: avoid oom killer for lowmem allocations If memory has been depleted in lowmem zones even with the protection afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that killing current users will help. The memory is either reclaimable (or migratable) already, in which case we should not invoke the oom killer at all, or it is pinned by an application for I/O. Killing such an application may leave the hardware in an unspecified state and there is no guarantee that it will be able to make a timely exit. Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is not used so that the task can perhaps recover or try again later. Previously, the heuristic provided some protection for those tasks with CAP_SYS_RAWIO, but this is no longer necessary since we will not be killing tasks for the purposes of ISA allocations. high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the default for all allocations that are not __GFP_DMA, __GFP_DMA32, __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those flags. Testing for high_zoneidx being less than ZONE_NORMAL will only return true for allocations that have either __GFP_DMA or __GFP_DMA32. Signed-off-by: David Rientjes <rientjes@google.com> --- mm/page_alloc.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1705,6 +1705,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, */ if (gfp_mask & __GFP_THISNODE) goto out; + /* The oom killer won't necessarily free lowmem */ + if (high_zoneidx < ZONE_NORMAL) + goto out; } /* Exhausted what can be done so it's blamo time */ out_of_memory(zonelist, gfp_mask, order, nodemask); --
On Tue, 16 Feb 2010 00:25:22 -0800 (PST) Then, can we take kdump in this endlessly looping situaton ? panic_on_oom=always + kdump can do that. Thanks, -Kame --
The endless loop is only helpful if something is going to free memory
external to the current page allocation: either another task with
__GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees
memory, or a task that exits.
The most notable endless loop in the page allocator is the one when a task
has been oom killed, gets access to memory reserves, and then cannot find
a page for a __GFP_NOFAIL allocation:
do {
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
preferred_zone, migratetype);
if (!page && gfp_mask & __GFP_NOFAIL)
congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
We don't expect any such allocations to happen during the exit path, but
we could probably find some in the fs layer.
I don't want to check sysctl_panic_on_oom in the page allocator because it
would start panicking the machine unnecessarily for the integrity
metadata GFP_NOIO | __GFP_NOFAIL allocation, for any
order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist
for oom kill that wouldn't have panicked before.
--
On Tue, 16 Feb 2010 16:03:23 -0800 (PST) Then, why don't you check higzone_idx in oom_kill.c Thanks, -Kame --
out_of_memory() doesn't return a value to specify whether the page
allocator should retry the allocation or just return NULL, all that policy
is kept in mm/page_alloc.c. For highzone_idx < ZONE_NORMAL, we want to
fail the allocation when !(gfp_mask & __GFP_NOFAIL) and call the oom
killer when it's __GFP_NOFAIL.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1696,6 +1696,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
/* The OOM killer will not help higher order allocs */
if (order > PAGE_ALLOC_COSTLY_ORDER)
goto out;
+ /* The OOM killer does not needlessly kill tasks for lowmem */
+ if (high_zoneidx < ZONE_NORMAL)
+ goto out;
/*
* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -1924,15 +1927,23 @@ rebalance:
if (page)
goto got_pg;
- /*
- * The OOM killer does not trigger for high-order
- * ~__GFP_NOFAIL allocations so if no progress is being
- * made, there are no other options and retrying is
- * unlikely to help.
- */
- if (order > PAGE_ALLOC_COSTLY_ORDER &&
- !(gfp_mask & __GFP_NOFAIL))
- goto nopage;
+ if (!(gfp_mask & __GFP_NOFAIL)) {
+ /*
+ * The oom killer is not called for high-order
+ * allocations that may fail, so if no progress
+ * is being made, there are no other options and
+ * retrying is unlikely to help.
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ goto nopage;
+ /*
+ * The oom killer is not called for lowmem
+ * allocations to prevent needlessly killing
+ * innocent tasks.
+ */
+ if (high_zoneidx < ZONE_NORMAL)
+ goto nopage;
+ }
goto restart;
}
--
I am not sure if this is a good idea, ZONE_DMA could have a lot of memory on some architectures. IIUC, we return NULL for allocations -- Three Cheers, Balbir --
As the patch description says, we would otherwise needlessly kill tasks that may not be consuming any lowmem since there is no way to determine its usage and typically the memory in lowmem will either be reclaimable (or migratable via memory compaction) if it is not pinned for I/O in which case we shouldn't kill for it anyway at this point. --
/proc/pid/oom_adj is now deprecated so that that it may eventually be removed. The target date for removal is December 2011. A warning will be printed to the kernel log if a task attempts to use this interface. Future warning will be suppressed until the kernel is rebooted to prevent spamming the kernel log. Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/feature-removal-schedule.txt | 30 ++++++++++++++++++++++++++++ Documentation/filesystems/proc.txt | 3 ++ fs/proc/base.c | 8 +++++++ include/linux/oom.h | 3 ++ 4 files changed, 44 insertions(+), 0 deletions(-) diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -168,6 +168,36 @@ Who: Eric Biederman <ebiederm@xmission.com> --------------------------- +What: /proc/<pid>/oom_adj +When: December 2011 +Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's + badness heuristic used to determine which task to kill when the kernel + is out of memory. + + The badness heuristic has since been rewritten since the introduction of + this tunable such that its meaning is deprecated. The value was + implemented as a bitshift on a score generated by the badness() + function that did not have any precise units of measure. With the + rewrite, the score is given as a proportion of available memory to the + task allocating pages, so using a bitshift which grows the score + exponentially is, thus, impossible to tune with fine granularity. + + A much more powerful interface, /proc/<pid>/oom_score_adj, was + introduced with the oom killer rewrite that allows users to increase or + decrease the badness() score linearly. This interface will replace + /proc/<pid>/oom_adj. + + See Documentation/filesystems/proc.txt for information on how to use the + new tunable. + + A ...
On Mon, 15 Feb 2010 14:20:16 -0800 (PST) There are systems that rely on this feature. It's ABI, its sacred. We are committed to it and it has users. That doesn't really detract from the good/bad of the rest of the proposal, it's just one step we can't quite make. Alan --
Andrew suggested that it be deprecated in this way, so that's what was done. I don't have any strong opinions about leaving it around forever now that it's otherwise unused beyond simply converting itself into units for /proc/pid/oom_score_adj at a much higher granularity. --
The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes. There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.
In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score. This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/mempolicy.h | 13 +++++++-
mm/mempolicy.c | 39 +++++++++++++++++++++++
mm/oom_kill.c | 77 +++++++++++++++++++++++++++-----------------
3 files changed, 98 insertions(+), 31 deletions(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags,
struct mempolicy **mpol, nodemask_t **nodemask);
extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask);
extern unsigned slab_node(struct mempolicy *policy);
extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
return node_zonelist(0, gfp_flags);
}
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+ return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask)
+{
+ return false;
+}
static inline int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- ...Seems reasonable, but I think it will require lots of testing. -- Three Cheers, Balbir --
I already tested it by checking that tasks with very elevated oom_adj values don't get killed when they do not share the same MPOL_BIND nodes as a memory-hogging task. What additional testing did you have in mind? --
Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.
Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.
Killing tasks outside of current's cpuset rarely would free memory for
current anyway.
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 12 +++---------
1 files changed, 3 insertions(+), 9 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
/* #define DEBUG */
/*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
*/
static int has_intersects_mems_allowed(struct task_struct *tsk)
{
@@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
points /= 4;
/*
- * If p's nodes don't overlap ours, it may still help to kill p
- * because p may have allocated or otherwise mapped memory on
- * this node before. However it will be less likely.
- */
- if (!has_intersects_mems_allowed(p))
- points /= 8;
-
- /*
* Adjust the score by oom_adj.
*/
if (oom_adj) {
@@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
+ if (!has_intersects_mems_allowed(p))
+ continue;
/*
* This task already has access to memory reserves and is
--
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variabl |
