Re: [PATCH] Revert oom rewrite series

Previous thread: Fixed Powerpc SPE data type conversion failure by Hai Shan on Saturday, November 13, 2010 - 9:11 pm. (4 messages)

Next thread: 2.6.37-rc1+: hibernate regression, claims not enough swap space by Pavel Machek on Saturday, November 13, 2010 - 10:21 pm. (10 messages)
From: KOSAKI Motohiro
Date: Saturday, November 13, 2010 - 10:07 pm

Linus,

Please apply this. this patch revert commits of oom changes since v2.6.35.

briefly says, "oom: badness heuristic rewrite" was merges by mistaken.
It haven't been passed our design nor code review. then multiple bug reports
has been popped up. I believe evey patches should pass a usecase and a code
review :-/

The problem is, DavidR patches don't refrect real world usecase at all
and breaking them. He can talk about the userland is wrong. but such
excuse doesn't solve real world issue. it makes no sense.

I hope every developers keep honestly development. googlers are NOT 
exception.


David, at least rss based oom score was passed our design review. 
So, if you will resubmit such part, we will ack it. please remember it.
Also, I can accept oom_score_adj feature if you can remove imcomatibility 
issue. OK?

Linus, if you want to check the patch. please use following way.
  % git diff a63d83f427fbce97a6cea0db2e64b0eb8435cd10^ mm/oom_kill.c include/linux/oom.h fs/proc/base.c


Thanks.

--------------------------------------------------------------------------
Subject: [PATCH] Revert oom rewrite series

This reverts following commits. They has broke an ABI and made multiple
enduser claim.

9c28ab662a8e3d19d07077ac0a8931c015e8afec Revert "oom: badness heuristic rewrite"
74cd8c6cb3e093c4d67ac3eb3581e246e4981dad Revert "oom: deprecate oom_adj tunable"
79a0bd5796e754c4b4e22071c4edddef3517d010 Revert "memcg: use find_lock_task_mm() in memory cgroups oom"
a465ef80c2a9fe73c85029fcea5c68ffee8dbb69 Revert "oom: always return a badness score of non-zero for eligible tas
516fcbb0c45d943df1b739d3be3d417aee2275f3 Revert "oom: filter unkillable tasks from tasklist dump"
b1c98f95a7954c450dadd809280f86863ea9d05d Revert "oom: add per-mm oom disable count"
fd79f3f47c82a0af5288afe7556905dd171bfc43 Revert "oom: avoid killing a task if a thread sharing its mm cannot be
2d72175528870dcef577db4a2a0b49d819c6eaff Revert "oom: kill all threads sharing oom killed task's ...
From: Linus Torvalds
Date: Sunday, November 14, 2010 - 12:32 pm

I'm not getting involved in this whole flame-war. You need to convince
Andrew, who has been the person everything went through.

                    Linus
--

From: KOSAKI Motohiro
Date: Sunday, November 14, 2010 - 5:54 pm

I wonder why he deep silence. But, _I_ strongly don't want to ignore bug report and
userland complain. I hope to fix any bug as far as my development time is allowed.



--

From: Andrew Morton
Date: Sunday, November 14, 2010 - 7:19 pm

Nothing to say, really.  Seems each time we're told about a bug or a
regression, David either fixes the bug or points out why it wasn't a
bug or why it wasn't a regression or how it was a deliberate behaviour
change for the better.

I just haven't seen any solid reason to be concerned about the state of
the current oom-killer, sorry.

I'm concerned that you're concerned!  A lot.  When someone such as
yourself is unhappy with part of MM then I sit up and pay attention. 
But after all this time I simply don't understand the technical issues
which you're seeing here.
--

From: Figo.zhang
Date: Sunday, November 14, 2010 - 9:41 pm

>Nothing to say, really.  Seems each time we're told about a bug or a
 >regression, David either fixes the bug or points out why it wasn't a
 >bug or why it wasn't a regression or how it was a deliberate behaviour
 >change for the better.

 >I just haven't seen any solid reason to be concerned about the state of
 >the current oom-killer, sorry.

 >I'm concerned that you're concerned!  A lot.  When someone such as
 >yourself is unhappy with part of MM then I sit up and pay attention.
 >But after all this time I simply don't understand the technical issues
 >which you're seeing here.

we just talk about oom-killer technical issues.

i am doubt that a new rewrite but the athor canot provide some evidence
and experiment result, why did you do that? what is the prominent change

for your new algorithm?

as KOSAKI Motohiro said, "you removed CAP_SYS_RESOURCE condition with
ZERO explanation".

David just said that pls use userspace tunable for protection by
oom_score_adj. but may i ask question:


1. what is your innovation for your new algorithm, the old one have the
same way for user tunable oom_adj.

2. if server like db-server/financial-server have huge import processes
(such as root/hardware access processes)want to be protection, you let

the administrator to find out which processes should be protection. you
will let the  financial-server administrator huge crazy!! and lose so
many money!! ^~^

3. i see your email in LKML, you just said
"I have repeatedly said that the oom killer no longer kills KDE when run

on my desktop in the presence of a memory hogging task that was written
specifically to oom the machine."
http://thread.gmane.org/gmane.linux.kernel.mm/48998


so you just test your new oom_killer algorithm on your desktop with KDE,
so have you provide the detail how you do the test? is it do the
experiment again for anyone and got the same result as your comment ?


as KOSAKI Motohiro said, in reality word, it we makes 5-6 brain
simulation, embedded, ...
From: KOSAKI Motohiro
Date: Sunday, November 14, 2010 - 11:57 pm

Of cource, I denied. He seems to think number of email is meaningful than
how talk about. but it's incorrect and makes no sense. Why not? Also, He
have to talk about logically. "Hey, I think it's not bug" makes no sense.
Such claim don't solve anything. userland is still unhappy. Why not?
I want to quickly action.

I would like to suggest they join and contribute any distro kernel 
maintainance team. Many community based distribution welcome to developrs.
And a bugfix work tell them a lot of thing. which usecase are freqently used,
which bug reports are fequently raised, etc.

That said, If anyone want to change userland ABI, Be carefully. They have
to investigate userland usecase carefully and avoid to break them carefully 
again. If someone think "hey, It's no big matter. userland rewritten can solve
an issue", I strongly disagree. they don't understand why all of userland 


You should have read my patch descriptions which I sent and my e-mail.


1) About two month ago, Dave hansen observed strange OOM issue because he
   has a big machine and ALL process are not so big. thus, eventually all 
   process got oom-score=0 and oom-killer didn't work.

   https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383

   DavidR changed oom-score to +1 in such situation. 

   http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455

   But it is completely bognus. If all process have score=1, oom-killer fall
   back to purely random killer. I expected and explained his patch has
   its problem at half years ago. but he didn't fix yet.

2) Also half years ago, I did explained oom_adj is used from multiple 
   applications. And we can't break them. But DavidR didn't fix.

3) Also about four month ago, I and kamezawa-san pointed out his patch
   don't work on memcg. It also haven't been fixed.


In the other hand, You can't explain what worth OOM-rewritten patch has. 
Because there is nothing. It is only "powerful"(TM) for Google. but 
instead It has zero ...
From: David Rientjes
Date: Monday, November 15, 2010 - 3:34 am

If there are pending complaints or bugs that I haven't addressed, please 
bring them to my attention.  To date, I know of no issues that have been 
raised that I have not addressed; you're always free to disagree with my 
position, but in the end you may find that when the kernel moves in a 

You may remember that the initial version of my rewrite replaced oom_adj 
entirely with the new oom_score_adj semantics.  Others suggested that it 
be seperated into a new tunable and the old tunable deprecated for a 
lengthy period of time.  I accepted that criticism and understood the 
drawbacks of replacing the tunable immediately and followed those 
suggestions.  I disagree with you that the deprecation of oom_adj for a 
period of two years is as dramatic as you imply and I disagree that users 
are experiencing problems with the linear scale that it now operates on 

The resolution with which the oom killer considers memory is at 0.1% of 
system RAM at its highest (smaller when you have a memory controller, 
cpuset, or mempolicy constrained oom).  It considers a task within 0.1% of 
memory of another task to have equal "badness" to kill, we don't break 
ties in between that resolution -- it all depends on which one shows up in 
the tasklist first.  If you disagree with that resolution, which I support 
as being high enough, then you may certainly propose a patch to make it 
even finer at 0.01%, 0.001%, etc.  It would only change oom_badness() to 

And we didn't.  oom_adj is still there and maps linearly to oom_score_adj; 
you just can't show a single application where that mapping breaks because 
it was based on an actual calculation.

If you would like to cite these "multiple" applications that need to be 
converted to use oom_score_adj (I know of udev), please let me know and 
if they're open-source applications then I will commit to submitting 
patches for them myself.  I believe the two year window is sufficient for 


Please see my reply to Figo.zhang where I enumerate the four ...
From: Jesper Juhl
Date: Monday, November 15, 2010 - 4:31 pm

On Mon, 15 Nov 2010, David Rientjes wrote:

[...]

I'm not going into the debate about whether or not deprecating one tunable 
for two years is sufficient or not. I'm simply going to mention one app 
that I know of that needs to be converted to use "oom_score_adj" on my 
box :

[jj@dragon ~]$ uname -a
Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 22:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux
[jj@dragon ~]$ dmesg | grep oom_adj
start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead.
[jj@dragon ~]$ /usr/lib/kde4/libexec/start_kdeinit --version

Qt: 4.7.1
KDE: 4.5.3 (KDE 4.5.3)



-- 
Jesper Juhl <jj@chaosbits.net>            http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

--

From: David Rientjes
Date: Monday, November 15, 2010 - 5:06 pm

Thanks for the report!  I'll get involved with kde-devel and send a patch 
to remove this dependency on newer kernels to expedite the process.

 [ Others with reports of deprecated use of oom_adj can contact me 
   privately and I'll find the parties of interest to avoid topics 
   unrelated to the kernel itself on LKML. ]
--

From: Martin Knoblauch
Date: Tuesday, November 16, 2010 - 3:04 am

CC trimmed for sanity ...

David,

 another one for your collection. You asked for it :-) This is CentOS-5.5 
running on top of kernel 2.6.36, likely out of initrd:

$ dmesg | grep deprecated
[    2.430330] nash-hotplug (67): /proc/67/oom_adj is deprecated, please use 
/proc/67/oom_score_adj instead.

Cheers
Martin
--

From: Alessandro Suardi
Date: Tuesday, November 16, 2010 - 3:33 am

On Tue, Nov 16, 2010 at 11:04 AM, Martin Knoblauch

...and another, on Fedora 14, 2.6.37-rc1-git11:

auditd (2583): /proc/2583/oom_adj is deprecated, please use
/proc/2583/oom_score_adj instead.

Cheers,

--alessandro

 "There's always a siren singing you to shipwreck"

   (Radiohead, "There There")
--

From: Valdis.Kletnieks
Date: Monday, November 15, 2010 - 5:13 pm

2:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel

Make that 2 common apps:

% uname -a
Linux turing-police.cc.vt.edu 2.6.37-rc1-mmotm1109 #1 SMP PREEMPT Wed Nov 10 12:30:17 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
% dmesg | grep oom
[   89.981594] sshd (4168): /proc/4168/oom_adj is deprecated, please use /proc/4168/oom_score_adj instead.
% rpm -q openssh
openssh-5.6p1-16.fc15.x86_64

5.6p1 is the latest-n-greatest released version on www.openssh.org, so somebody
probably needs to rattle their chain...

From: David Rientjes
Date: Monday, November 15, 2010 - 11:43 pm

Thanks, Darren Tucker fixed this a few hours after I reported it on the 
openssh bugzilla, the patch is at 
https://bugzilla.mindrot.org/show_bug.cgi?id=1838 -- it uses oom_score_adj 
if it exists and then falls back to oom_adj if running on an older kernel.
--

From: Alan Cox
Date: Tuesday, November 16, 2010 - 4:03 am

But current openssh needs to support old kernels.

This is why this kind of obsoleting doesn't work well. It's not "update
your app" so much as "drop support for older stuff or start doing
complicated crap dependant on version"

and it's why for tiny amounts of code it is the *wrong* thing to force
obsolete stuff especially when it still doesn't seem to have been
properly marked for deprecation in the first place.

--

From: Florian Mickler
Date: Tuesday, November 16, 2010 - 6:03 am

On Tue, 16 Nov 2010 11:03:10 +0000

How does one mark it apropriately?
The commit 51b1bd2 (oom: deprecate oom_adj tunable, see below) 
added it to feature-removal-schedule.txt, a patch for
Documentation/ABI has also been provided in the meantime, if i'm not
mistaken. 

And there is already a patch for openssh:
https://bugzilla.mindrot.org/show_bug.cgi?id=1838

Regards,
Flo

commit 51b1bd2ace1595b72956224deda349efa880b693
Author: David Rientjes <rientjes@google.com>
Date:   Mon Aug 9 17:19:47 2010 -0700

    oom: deprecate oom_adj tunable
    
    /proc/pid/oom_adj is now deprecated so that that it may eventually be
    removed.  The target date for removal is August 2012.
    
    A warning will be printed to the kernel log if a task attempts to use this
    interface.  Future warning will be suppressed until the kernel is rebooted
    to prevent spamming the kernel log.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Cc: Nick Piggin <npiggin@suse.de>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Balbir Singh <balbir@in.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
--

From: Alan Cox
Date: Tuesday, November 16, 2010 - 7:55 am

Yes - so why is it spewing crap, annoying users and trying to irritate
application authors. It's not 2012 yet.

--

From: David Rientjes
Date: Tuesday, November 16, 2010 - 1:57 pm

It's a WARN_ON_ONCE() so it will only spew a single line as a reminder 
that the application needs to be updated; would you prefer that to be 
suppressed until a year before removal, for example?
--

From: Fabio Comolli
Date: Tuesday, November 16, 2010 - 2:01 pm

[CC: list trimmed again for sanity]

Another one:

[   34.709156] chromium-browse (1439): /proc/1480/oom_adj is
deprecated, please use /proc/1480/oom_score_adj instead.

2.6.37-rc2 - archlinux - package chromium-browser-ppa from AUR




--

From: Valdis.Kletnieks
Date: Tuesday, November 16, 2010 - 9:04 pm

Aug 2012 is only 6 kernel releases or so away....

Presumably the whinging is so we start tracking down the offending userspace
and getting it fixed before 2012 gets here.  Sticking the warning in just one
or two kernel releases before it becomes official leads to "I can't run the new
kernel because my userspace isn't patched yet".  We really can't win here,
we don't whinge and stuff doesn't get tracked down and fixed, we do whinge
and that gets people upset too.
From: Alejandro Riveira Fernández
Date: Tuesday, November 16, 2010 - 8:15 am

El Mon, 15 Nov 2010 19:13:15 -0500

$ dmesg | grep deprecated
[    1.473365] udevd (662): /proc/662/oom_adj is deprecated, please use /proc/662/oom_score_adj instead.
$ apt-cache policy udev
udev:
  Instalados: 151-12.3
  Candidato: 151-12.3
  Tabla de versión:
 *** 151-12.3 0
        500 http://es.archive.ubuntu.com/ubuntu/ lucid-proposed/main Packages
        100 /var/lib/dpkg/status
     151-12.2 0
        500 http://es.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
     151-12 0
        500 http://es.archive.ubuntu.com/ubuntu/ lucid/main Packages
$ uname -a
Linux varda 2.6.36-00001-g90d39e9 #145 SMP PREEMPT Wed Oct 20 23:27:44 CEST 2010 x86_64 GNU/Linux

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 12:16 am

I can't understand. Why do I need to ignore userland folks? WHY?
I have no reason userland complain. I tend to prefer to avoid userland 


No.
Think Moore's Law. rational value will be not able to work in future anyway.
10 years ago, I used 20M bytes memory desktop machine and I'm now using 2GB.

If you want, you have to change userland at first and by yourself. Don't


I'm NOT interesting *powerful* crap. Please DON'T talk which is powerful.



--

From: David Rientjes
Date: Saturday, November 27, 2010 - 6:45 pm

If you'd like to suggest an increase to the upper-bound of the badness 
score, please do so, although I don't think we need to break ties amongst 
tasks that differ by at most <0.1% of the system's capacity.
--

From: KOSAKI Motohiro
Date: Tuesday, November 30, 2010 - 6:04 am

Why?

No. I dislike. I dislike propotinal score.

--

From: David Rientjes
Date: Tuesday, November 30, 2010 - 1:02 pm

There have never been any bug reports related to applications using 
oom_score_adj and being impacted with its linear mapping onto oom_adj's 
exponential scale.  That's because no users prior to the rewrite were 
using oom_adj scores that were based on either the expected memory usage 
of the application nor the capacity of the machine.
--

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 12:16 am

Zero question?  If so, I'll resend the revert to linus.


Actually, I don't tend hear the shouting. They aren't discussion. It's
only crappy shout. Googlers have to think why no person agree their claim.
ZERO. even though >20 people discussed with them. DavidR seems to continue
to make flame. But I don't care. He have to learn making flame don't solve
ANYTHING.

And they have to learn correct discussion way and which is different of 
discusstion and shouting. and why we have to learn userland workload and
have to avoid any breakage. I'm angry googlers frequently break kernel
and frequently ignore userland claim.



--

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 4:51 pm

I did.

Therefore, I will resend the patch to you. Thanks.


--------------------------------------------------------------------------
Subject: [PATCH] Revert oom rewrite series

This reverts following commits. They has broke an ABI and made multiple
enduser claim.

9c28ab662a8e3d19d07077ac0a8931c015e8afec Revert "oom: badness heuristic rewrite"
74cd8c6cb3e093c4d67ac3eb3581e246e4981dad Revert "oom: deprecate oom_adj tunable"
79a0bd5796e754c4b4e22071c4edddef3517d010 Revert "memcg: use find_lock_task_mm() in memory cgroups oom"
a465ef80c2a9fe73c85029fcea5c68ffee8dbb69 Revert "oom: always return a badness score of non-zero for eligible tas
516fcbb0c45d943df1b739d3be3d417aee2275f3 Revert "oom: filter unkillable tasks from tasklist dump"
b1c98f95a7954c450dadd809280f86863ea9d05d Revert "oom: add per-mm oom disable count"
fd79f3f47c82a0af5288afe7556905dd171bfc43 Revert "oom: avoid killing a task if a thread sharing its mm cannot be
2d72175528870dcef577db4a2a0b49d819c6eaff Revert "oom: kill all threads sharing oom killed task's mm"
be212960618ddcdb9526ce2cb73fd081fd3e90ea Revert "oom: rewrite error handling for oom_adj and oom_score_adj tunab
1b17c41599c594c7d11ef415a92d47c205fe89ea Revert "oom: fix locking for oom_adj and oom_score_adj"

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/feature-removal-schedule.txt |   25 ---
 Documentation/filesystems/proc.txt         |   97 ++++-----
 fs/exec.c                                  |    5 -
 fs/proc/base.c                             |  176 ++--------------
 include/linux/memcontrol.h                 |    8 -
 include/linux/mm_types.h                   |    2 -
 include/linux/oom.h                        |   19 +--
 include/linux/sched.h                      |    3 +-
 kernel/exit.c                              |    3 -
 kernel/fork.c                              |   16 +--
 mm/memcontrol.c                            |   28 +---
 mm/oom_kill.c                              |  323 ...
From: David Rientjes
Date: Sunday, November 14, 2010 - 2:58 pm

That's inaccurate, there haven't been multiple bug reports popping up 
since the rewrite; in fact, there hasn't been a single bug report.

There have been two changes to the oom killer since the rewrite:

 - we now kill all threads sharing the oom killed task that share the ->mm 
   since we can't free any memory without them exiting as well, and

 - we count threads that are immune from oom kill attached to an ->mm so 
   we can avoid needlessly killing tasks that aren't immune themselves but 
   have other threads sharing the ->mm that are.

Both of those changes were needed in the old oom killer as well, they have 
nothing to do with the rewrite.

Also, stating that the new heuristic doesn't address CAP_SYS_RESOURCE 
approrpiately isn't a bug report, it's the desired behavior.  I eliminated 
all of the arbitrary heursitics in the old heuristic that we had the 
remove internally as well so that is predictable as possible and achieves 
the oom killer's sole goal: to kill the most memory-hogging task that is 
eligible to allow memory allocations in the current context to succeed.  
CAP_SYS_RESOURCE threads have full control over their oom killing priority 
by /proc/pid/oom_score_adj and need no consideration in the heuristic by 
default since it otherwise allows for the probability that multiple tasks 
will need to be killed when a CAP_SYS_RESOURCE thread uses an egregious 

As mentioned just a few minutes ago in another thread, there is no 
userspace breakage with the rewrite and you're only complaining here about 
the deprecation of /proc/pid/oom_adj for a period of two years.  Until 
it's removed in 2012 or later, it maps to the linear scale that 
oom_score_adj uses rather than its old exponential scale that was 
unusable for prioritization because of (1) the extremely low resolution, 
and (2) the arbitrary heuristics that preceeded it.

You've proposed various forms of your revert (this is the fifth one) and 
I've responded in a very respectful and technical way each ...
From: Bodo Eggert
Date: Monday, November 15, 2010 - 4:33 pm

, but unless they are written in the last months and designed for linux
and if the author took some time to research each external process 
invocation, they can not be aware of this possibility.

Besides that, if each process is supposed to change the default, the 

If it happens to use an egregious mount of memory, it SHOULD score

1) The exponential scale did have a low resolution.

2) The heuristics were developed using much brain power and much
    trial-and-error. You are going back to basics, and some people
    are not convinced that this is better. I googled and I did not
    find a discussion about how and why the new score was designed
    this way.
    looking at the output of:
    cd /proc; for a in [0-9]*; do
      echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
    done|grep -v ^0|sort -n |less
    , I 'm not convinced, too.

PS) Mapping an exponential value to a linear score is bad. E.g. A
     oom_adj of 8 should make an 1-MB-process as likely to kill as
     a 256-MB-process with oom_adj=0.

PS2) Because I saw this in your presentation PDF: (@udev-people)
     The -17 score of udevd is wrong, since it will even prevent
     the OOM killer from working correctly if it grows to 100 MB:

     It's default OOM score is 13, while root's shell is at 190
     and some KDE processes are at 200 000. It will not get killed
     under normal circumstances.

     If it udevd grows enough to score 190 as well, it has a bug
     that causes it to eat memory and it needs to be killed. Having
     a -17 oom_adj, it will cause the system to fail instead.
     Considering udevd's size, an adj of -1 or -2 should be enough on
     embedded systems, while desktop systems should not need it.
     If you are worried about udevd getting killed, protect ist using
     a wrapper.
--

From: David Rientjes
Date: Monday, November 15, 2010 - 4:50 pm

You're clearly wrong, CAP_SYS_RESOURCE has been required to modify oom_adj 
for over five years (as long as the git history).  8fb4fc68, merged into 
2.6.20, allowed tasks to raise their own oom_adj but not decrease it.  

That doesn't make any sense, if want to protect a thread from the oom 
killer you're going to need to modify oom_score_adj, the kernel can't know 
what you perceive as being vital.  Having CAP_SYS_RESOURCE alone does not 
imply that, it only allows unbounded access to resources.  That's 
completely orthogonal to the goal of the oom killer heuristic, which is to 

The old heuristics were a mixture of arbitrary values that didn't adjust 
scores based on a unit and would often cause the incorrect task to be 
targeted because there was no clear goal being achieved.  The new 
heuristic has a solid goal: to identify and kill the most memory-hogging 
task that is eligible given the context in which the oom occurs.  If you 
disagree with that goal and want any of the old heursitics reintroduced, 

To show that, you would have to show that an application that exists today 
uses an oom_adj for something other than polarization and is based on a 

Threads with CAP_SYS_RESOURCE are free to lower the oom_score_adj of any 
thread they deem fit and that includes applications that lower its own 
oom_score_adj.  The kernel isn't going to prohibit users from setting 
their own oom_score_adj.
--

From: Bodo Eggert
Date: Tuesday, November 16, 2010 - 5:06 pm

You are misunderstanding me. It was allowed to do this, but it did not need 
to do it yet. It was enough to be a well-written POSIX application without 

The old oom killer's task was to guess the best victim to kill. For me, it 
did a good job (but the system kept thrashing for too long until it kicked
the offender). Looking at CAP_SYS_RESOURCE was one way to recognize 

The first old OOM killer did the same as you promise the current one does,
except for your bugfixes. That's why it killed the wrong applications and
all the heuristics were added until the complaints stopped.

Off cause I did not yet test your OOM killer, maybe it really is better.
Heuristics tend to rot and you did much work to make it right.

I don't want the old OOM killer back, but I don't want you to fall

No such application should exist because the OOM killer should DTRT.
oom_adj was supposed to let the sysadmin lower his mission-critical
DB's score to be just lower than the less-important tasks, or to

My point is: The udev people should not prevent the OOM killer 
unconditionally, it has an important task in case something goes wrong.
I just didn't want to start a new thread at that time of day.
-- 
How do I set my laser printer on stun?
--

From: David Rientjes
Date: Tuesday, November 16, 2010 - 5:25 pm

CAP_SYS_RESOURCE does not imply the task is important.

There's a problem when the kernel is oom; killing a thread that is getting 
work done is one of the most serious remedies the kernel will ever do to 
allow forward progress.  In almost all scenarios (except in some cpuset or 
memcg configurations), it's a userspace configuration issue that exhausts 
memory and the VM finds no other alternative.  CAP_SYS_RESOURCE threads 
have access to unbounded amounts of resources and thus can use an 
extremely large amount of memory very quickly and at a detriment to other 
threads that may be as important to more important.  Considering them any 
different is an unsubstantiated and undefined behavior that should not be 
considered in the heuristic _unless_ the administrator or the task itself 

No, the old oom killer did not always kill the application that used the 
most amount of memory; it considered other factors with arbitrary point 
deductions such as nice level, runtime, CAP_SYS_RAWIO, CAP_SYS_RESOURCE, 
etc.  We had to remove those heuristics internally in older kernels as 
well because it would often allow a task to runaway using a massive amount 
of memory because of leaks and kill everything else on the system before 
targeting the appropriate task.  At that point, it left the system with 

Thanks, and that's why I'm trying to avoid additional heuristics such 
CAP_SYS_RESOURCE where the priority is _implied_ rather than _proven_.  If 
CAP_SYS_RESOURCE was defined to be more preferred to stay alive, then I'd 

oom_score_adj allows use to define when an application is using more 
memory than expected and is often helpful in cpuset, memcg, or mempolicy 
constrained cases as well.  We'd like to be able to say that 30% of 
available memory should be discounted from a particular task that is 
expected to use 30% more memory than others without getting preferred.  
oom_score_adj can do that, oom_adj could not.
--

From: Mandeep Singh Baines
Date: Tuesday, November 16, 2010 - 5:48 pm

Here's a patch I've been working on to control thrashing.

http://lkml.org/lkml/2010/10/28/289

It works well for our app: web browser. We'd rather OOM quickly and kill
a browser tab than thrash for a few minutes and then OOM. It works well for
--

Previous thread: Fixed Powerpc SPE data type conversion failure by Hai Shan on Saturday, November 13, 2010 - 9:11 pm. (4 messages)

Next thread: 2.6.37-rc1+: hibernate regression, claims not enough swap space by Pavel Machek on Saturday, November 13, 2010 - 10:21 pm. (10 messages)