In 2.6.3* kernels (test case was performed on the 2.6.33.3 kernel)
when physical memory runs out and there is a large swap partition -
the system completely stalls.
I noticed that when running debian lenny using dm-crypt with
encrypted / and swap with a 2.6.33.3 kernel (and all of the 2.6.3*
series iirc) when all physical memory is used (swapiness was left at
the default 60) the system hangs and does not respond. It can resume
normal operation some time later - however it seems to take a *very*
long time for the oom killer to come in. Obviously with swapoff this
doesn't happen - the oom killer comes in and does its job.
free -m
total used free shared buffers cached
Mem: 1980 1101 879 0 58 201
-/+ buffers/cache: 840 1139
Swap: 24943 0 24943
My simple test case is
dd if=/dev/zero of=/tmp/stall
and wait till /tmp fills...
--
Is there a reason - no one has taken any interesting in my email ?.... The behaviour isn't found on the 2.6.26 debian kernel. So I was thinking that it might be due to my intel graphics card / memory interplay ? .... --
Is that tmpfs sized the default 50% of RAM? But I wonder if you're suffering from a bug which KOSAKI-San just identified, and has very recently posted this patch: please try it and let us all know - thanks. Hugh [PATCH] tmpfs: Insert tmpfs cache pages to inactive list at first Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer. This is regression of caused by commit 9ff473b9a7 (vmscan: evict streaming IO first). Wow, It is 2 years old patch! Currently, tmpfs file cache is inserted active list at first. It mean the insertion doesn't only increase numbers of pages in anon LRU, but also reduce anon scanning ratio. Therefore, vmscan will get totally confusion. It scan almost only file LRU even though the system have plenty unused tmpfs pages. Historically, lru_cache_add_active_anon() was used by two reasons. 1) Intend to priotize shmem page rather than regular file cache. 2) Intend to avoid reclaim priority inversion of used once pages. But we've lost both motivation because (1) Now we have separate anon and file LRU list. then, to insert active list doesn't help such priotize. (2) In past, one pte access bit will cause page activation. then to insert inactive list with pte access bit mean higher priority than to insert active list. Its priority inversion may lead to uninteded lru chun. but it was already solved by commit 645747462 (vmscan: detect mapped file pages used only once). (Thanks Hannes, you are great!) Thus, now we can use lru_cache_add_anon() instead. Reported-by: Shaohua Li <shaohua.li@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/filemap.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index b941996..023ef61 100644 --- a/mm/filemap.c +++ ...
That was just a simple test case with dd. That test case might be invalid - but it is trying to trigger out of memory - doing this any other way still causes the problem. I note that playing with some bios settings I was actually able to trigger what appeared to be graphics corruption issues when I launched kde applications ... nothing shows up in dmesg so this might just be a conflict between xorg and the kernel with those bios settings... Anyway, This is no longer a 'problem' for me since I disabled overcommit and altered the values for dirty_ratio and dirty_background_ratio - and I cannot trigger it. --
Disabling overcommit should always do it, but I'd be interested to know if restoring dirty_ratio to 40 would help your usecase. --
Actually it turns out on 2.6.34.1 I can trigger this issue. What it
really is, is that linux doesn't invoke the oom killer when it should
and kill something off. This is *really* annoying.
I used the follow script - (on 2.6.34.1)
cat ./scripts/disable_over_commit
#!/bin/bash
echo 2 > /proc/sys/vm/overcommit_memory
echo 40 > /proc/sys/vm/dirty_ratio
echo 5 > /proc/sys/vm/dirty_background_ratio
And I was still able to reproduce this bug.
Here is some c code to trigger the condition I am talking about.
#include <stdlib.h>
#include <stdio.h>
int main(void)
{
while(1)
{
malloc(1000);
}
return 0;
}
--
I'm not exactly sure what you're referring to, it's been two months and you're using a new kernel and now you're saying that the oom killer isn't being utilized when the original problem statement was that it was killing --
Sorry about the timespan :( Well actually it is the same issue. Originally the oom killer wasn't being invoked and now the problem is still it isn't invoked - it doesn't come and kill things - my desktop just sits :) I have since replaced the hard disk - which I thought could be the issue. I am thinking that because I have shared graphics not using KMS - with intel graphics - this may be the root of the cause. -- All things that are, are with more spirit chased than enjoyed. -- Shakespeare, "Merchant of Venice" --
Do you mean the issue will be gone if disabling intel graphics? if so, we need intel graphics driver folks help. sorry, linux-mm folks don't know intel graphics detail. --
Well the only other system I have running the 2.6.34.1 kernel atm is an arm based system. I originally sent this to the kernel list and was told I should probably forward it to the mm list. It may be a general issue or it could just be specific :) -- "Not Hercules could have knock'd out his brains, for he had none." -- Shakespeare --
Hmm.. I'm puzzled 8-) I don't understand why other all people can't reproduce your issue even though your reproduce program is very simple. So, I'm guessing there is hidden reproduce condition. but I have no idea to find it. --
I will try with the latest ubuntu and report how that goes (that will be using fairly new xorg etc.) it is likely to be hidden issue just with the intel graphics driver. However, my concern is that it isn't - and it is about how shared graphics memory is handled :) --
Ok my desktop still stalled and no oom killer was invoked when I added swap to a live-cd of 10.04 amd64. *Without* *swap* *on* - the oom killer was invoked - here is a copy of it. [ 298.180542] Xorg invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0 [ 298.180553] Xorg cpuset=/ mems_allowed=0 [ 298.180560] Pid: 3808, comm: Xorg Not tainted 2.6.32-21-generic #32-Ubuntu [ 298.180564] Call Trace: [ 298.180583] [<ffffffff810b37cd>] ? cpuset_print_task_mems_allowed+0x9d/0xb0 [ 298.180595] [<ffffffff810f64f4>] oom_kill_process+0xd4/0x2f0 [ 298.180603] [<ffffffff810f6ab0>] ? select_bad_process+0xd0/0x110 [ 298.180609] [<ffffffff810f6b48>] __out_of_memory+0x58/0xc0 [ 298.180616] [<ffffffff810f6cde>] out_of_memory+0x12e/0x1a0 [ 298.180626] [<ffffffff81540c9e>] ? _spin_lock+0xe/0x20 [ 298.180633] [<ffffffff810f9d21>] __alloc_pages_slowpath+0x511/0x580 [ 298.180641] [<ffffffff810f9eee>] __alloc_pages_nodemask+0x15e/0x1a0 [ 298.180650] [<ffffffff8112ca57>] alloc_pages_current+0x87/0xd0 [ 298.180657] [<ffffffff810f8e0e>] __get_free_pages+0xe/0x50 [ 298.180666] [<ffffffff81154994>] __pollwait+0xb4/0xf0 [ 298.180673] [<ffffffff814e09a5>] unix_poll+0x25/0xc0 [ 298.180682] [<ffffffff81449bea>] sock_poll+0x1a/0x20 [ 298.180688] [<ffffffff811545b2>] do_select+0x3a2/0x6d0 [ 298.180696] [<ffffffff811548e0>] ? __pollwait+0x0/0xf0 [ 298.180702] [<ffffffff811549d0>] ? pollwake+0x0/0x60 [ 298.180708] [<ffffffff811549d0>] ? pollwake+0x0/0x60 [ 298.180714] [<ffffffff811549d0>] ? pollwake+0x0/0x60 [ 298.180721] [<ffffffff811549d0>] ? pollwake+0x0/0x60 [ 298.180727] [<ffffffff811549d0>] ? pollwake+0x0/0x60 [ 298.180732] [<ffffffff811549d0>] ? pollwake+0x0/0x60 [ 298.180737] [<ffffffff811549d0>] ? pollwake+0x0/0x60 [ 298.180741] [<ffffffff811549d0>] ? pollwake+0x0/0x60 [ 298.180745] [<ffffffff811549d0>] ? pollwake+0x0/0x60 [ 298.180749] [<ffffffff811550ba>] core_sys_select+0x18a/0x2c0 [ 298.180777] [<ffffffffa001eced>] ? drm_ioctl+0x13d/0x480 ...
This stack seems similar following bug. can you please try to disable intel graphics driver? --
Ok I am not sure how to do that :) I could revert the patch and see if it 'fixes' this :) --
Oops, no, revert is not good action. the patch is correct. probably my explanation was not clear. sorry. I did hope to disable 'driver' (i.e. using vga), not disable the patch. Thanks. --
Oh you mean in xorg, I will also blacklist the module. Sure that patch might not it but in 2.6.26 the problem isn't there :) --
Ok I re-tested with 2.6.26 and 2.6.34.1 So I will describe what happens below: 2.6.26 - with xorg running "Given I have a test file called a.out And I can see Xorg And I am using 2.6.26 And I have swap on When I run it I run a.out Then I see the system freeze up slightly And the hard drive churns( and the cpu is doing something as the large fan kicks) And after a while the system unfreezes" 2.6.26 - from single mode - before xorg starts and i915 is *not* loaded. "Given I have a test file called a.out And I cannot see Xorg And I am using 2.6.26 And I have swap on When I run it I run a.out Then I see the system freeze up And the system fan doesn't spin any faster And the system just sits idle" 2.6.34.1 With and without xorg - WITH spam on the same behaviour as in the 2.6.26 kernel appears (when xorg is not loaded). OOM attached from the 2.6.26 kernel when I used magic keys to invoke the oom killer :) (this was on the 2.6.26 kernel - before i915 had loaded and in single mode). [ 280.323899] SysRq : Manual OOM execution [ 280.324009] events/0 invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 [ 280.324056] Pid: 9, comm: events/0 Not tainted 2.6.26-2-amd64 #1 [ 280.324098] [ 280.324099] Call Trace: [ 280.324200] [<ffffffff8027388c>] oom_kill_process+0x57/0x1dc [ 280.324247] [<ffffffff8023b49d>] __capable+0x9/0x1c [ 280.324290] [<ffffffff80273bb7>] badness+0x188/0x1c7 [ 280.324341] [<ffffffff80273deb>] out_of_memory+0x1f5/0x28e [ 280.324396] [<ffffffff8037824c>] moom_callback+0x0/0x1a [ 280.324449] [<ffffffff80243070>] run_workqueue+0x82/0x111 [ 280.324497] [<ffffffff8024393d>] worker_thread+0xd5/0xe0 [ 280.324543] [<ffffffff80246171>] autoremove_wake_function+0x0/0x2e [ 280.324596] [<ffffffff80243868>] worker_thread+0x0/0xe0 [ 280.324637] [<ffffffff8024604b>] kthread+0x47/0x74 [ 280.324678] [<ffffffff802300ed>] schedule_tail+0x27/0x5c [ 280.326721] [<ffffffff8020cf38>] child_rip+0xa/0x12 [ 280.326788] ...
Ok this issue is still around and still *really* annoying. So I had a 5mb text file, I put %s/\n/, in vim, my desktop stalls as vim uses memory it sits there for ~10 minutes before finally the oom killer wakes up and does something.... This is on totally different hardware now(amd phenom ddr3 ram, SATA 3 disk) and Here is some dmesg output :) ep 21 22:41:44 RANDOMBOXEN kernel: [329160.956367] kjournald D ffff88011be59a00 0 982 2 0x00000000 Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956370] ffff88011bf9fbf0 0000000000000046 ffff88011bf9fbc0 ffffffffa00f0775 Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956373] ffff88011bf9ffd8 0000000000013900 ffff88011bf9ffd8 ffff88011be59680 Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956375] 0000000000013900 0000000000013900 0000000000013900 0000000000013900 Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956377] Call Trace: Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956399] [<ffffffffa00f0775>] ? dm_table_unplug_all+0x54/0xc6 [dm_mod] Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956405] [<ffffffff812e4f80>] io_schedule+0x7b/0xc1 Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956408] [<ffffffff8110d0ea>] sync_buffer+0x3b/0x3f Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956409] [<ffffffff812e5488>] __wait_on_bit+0x47/0x79 Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956411] [<ffffffff8110d0af>] ? sync_buffer+0x0/0x3f Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956413] [<ffffffff8110d0af>] ? sync_buffer+0x0/0x3f Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956415] [<ffffffff812e5524>] out_of_line_wait_on_bit+0x6a/0x77 Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956418] [<ffffffff8105b678>] ? wake_bit_function+0x0/0x2a Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956419] [<ffffffff8110d06f>] __wait_on_buffer+0x1f/0x21 Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956425] [<ffffffffa0165824>] journal_commit_transaction+0xa42/0xfba [jbd] Sep 21 22:41:44 RANDOMBOXEN kernel: [329160.956427] [<ffffffff812e4e36>] ? ...
