Linus, Please apply this. this patch revert commits of oom changes since v2.6.35. briefly says, "oom: badness heuristic rewrite" was merges by mistaken. It haven't been passed our design nor code review. then multiple bug reports has been popped up. I believe evey patches should pass a usecase and a code review :-/ The problem is, DavidR patches don't refrect real world usecase at all and breaking them. He can talk about the userland is wrong. but such excuse doesn't solve real world issue. it makes no sense. I hope every developers keep honestly development. googlers are NOT exception. David, at least rss based oom score was passed our design review. So, if you will resubmit such part, we will ack it. please remember it. Also, I can accept oom_score_adj feature if you can remove imcomatibility issue. OK? Linus, if you want to check the patch. please use following way. % git diff a63d83f427fbce97a6cea0db2e64b0eb8435cd10^ mm/oom_kill.c include/linux/oom.h fs/proc/base.c Thanks. -------------------------------------------------------------------------- Subject: [PATCH] Revert oom rewrite series This reverts following commits. They has broke an ABI and made multiple enduser claim. 9c28ab662a8e3d19d07077ac0a8931c015e8afec Revert "oom: badness heuristic rewrite" 74cd8c6cb3e093c4d67ac3eb3581e246e4981dad Revert "oom: deprecate oom_adj tunable" 79a0bd5796e754c4b4e22071c4edddef3517d010 Revert "memcg: use find_lock_task_mm() in memory cgroups oom" a465ef80c2a9fe73c85029fcea5c68ffee8dbb69 Revert "oom: always return a badness score of non-zero for eligible tas 516fcbb0c45d943df1b739d3be3d417aee2275f3 Revert "oom: filter unkillable tasks from tasklist dump" b1c98f95a7954c450dadd809280f86863ea9d05d Revert "oom: add per-mm oom disable count" fd79f3f47c82a0af5288afe7556905dd171bfc43 Revert "oom: avoid killing a task if a thread sharing its mm cannot be 2d72175528870dcef577db4a2a0b49d819c6eaff Revert "oom: kill all threads sharing oom killed task's ...
I'm not getting involved in this whole flame-war. You need to convince
Andrew, who has been the person everything went through.
Linus
--
I wonder why he deep silence. But, _I_ strongly don't want to ignore bug report and userland complain. I hope to fix any bug as far as my development time is allowed. --
Nothing to say, really. Seems each time we're told about a bug or a regression, David either fixes the bug or points out why it wasn't a bug or why it wasn't a regression or how it was a deliberate behaviour change for the better. I just haven't seen any solid reason to be concerned about the state of the current oom-killer, sorry. I'm concerned that you're concerned! A lot. When someone such as yourself is unhappy with part of MM then I sit up and pay attention. But after all this time I simply don't understand the technical issues which you're seeing here. --
>Nothing to say, really. Seems each time we're told about a bug or a >regression, David either fixes the bug or points out why it wasn't a >bug or why it wasn't a regression or how it was a deliberate behaviour >change for the better. >I just haven't seen any solid reason to be concerned about the state of >the current oom-killer, sorry. >I'm concerned that you're concerned! A lot. When someone such as >yourself is unhappy with part of MM then I sit up and pay attention. >But after all this time I simply don't understand the technical issues >which you're seeing here. we just talk about oom-killer technical issues. i am doubt that a new rewrite but the athor canot provide some evidence and experiment result, why did you do that? what is the prominent change for your new algorithm? as KOSAKI Motohiro said, "you removed CAP_SYS_RESOURCE condition with ZERO explanation". David just said that pls use userspace tunable for protection by oom_score_adj. but may i ask question: 1. what is your innovation for your new algorithm, the old one have the same way for user tunable oom_adj. 2. if server like db-server/financial-server have huge import processes (such as root/hardware access processes)want to be protection, you let the administrator to find out which processes should be protection. you will let the financial-server administrator huge crazy!! and lose so many money!! ^~^ 3. i see your email in LKML, you just said "I have repeatedly said that the oom killer no longer kills KDE when run on my desktop in the presence of a memory hogging task that was written specifically to oom the machine." http://thread.gmane.org/gmane.linux.kernel.mm/48998 so you just test your new oom_killer algorithm on your desktop with KDE, so have you provide the detail how you do the test? is it do the experiment again for anyone and got the same result as your comment ? as KOSAKI Motohiro said, in reality word, it we makes 5-6 brain simulation, embedded, ...
Of cource, I denied. He seems to think number of email is meaningful than how talk about. but it's incorrect and makes no sense. Why not? Also, He have to talk about logically. "Hey, I think it's not bug" makes no sense. Such claim don't solve anything. userland is still unhappy. Why not? I want to quickly action. I would like to suggest they join and contribute any distro kernel maintainance team. Many community based distribution welcome to developrs. And a bugfix work tell them a lot of thing. which usecase are freqently used, which bug reports are fequently raised, etc. That said, If anyone want to change userland ABI, Be carefully. They have to investigate userland usecase carefully and avoid to break them carefully again. If someone think "hey, It's no big matter. userland rewritten can solve an issue", I strongly disagree. they don't understand why all of userland You should have read my patch descriptions which I sent and my e-mail. 1) About two month ago, Dave hansen observed strange OOM issue because he has a big machine and ALL process are not so big. thus, eventually all process got oom-score=0 and oom-killer didn't work. https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383 DavidR changed oom-score to +1 in such situation. http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455 But it is completely bognus. If all process have score=1, oom-killer fall back to purely random killer. I expected and explained his patch has its problem at half years ago. but he didn't fix yet. 2) Also half years ago, I did explained oom_adj is used from multiple applications. And we can't break them. But DavidR didn't fix. 3) Also about four month ago, I and kamezawa-san pointed out his patch don't work on memcg. It also haven't been fixed. In the other hand, You can't explain what worth OOM-rewritten patch has. Because there is nothing. It is only "powerful"(TM) for Google. but instead It has zero ...
If there are pending complaints or bugs that I haven't addressed, please bring them to my attention. To date, I know of no issues that have been raised that I have not addressed; you're always free to disagree with my position, but in the end you may find that when the kernel moves in a You may remember that the initial version of my rewrite replaced oom_adj entirely with the new oom_score_adj semantics. Others suggested that it be seperated into a new tunable and the old tunable deprecated for a lengthy period of time. I accepted that criticism and understood the drawbacks of replacing the tunable immediately and followed those suggestions. I disagree with you that the deprecation of oom_adj for a period of two years is as dramatic as you imply and I disagree that users are experiencing problems with the linear scale that it now operates on The resolution with which the oom killer considers memory is at 0.1% of system RAM at its highest (smaller when you have a memory controller, cpuset, or mempolicy constrained oom). It considers a task within 0.1% of memory of another task to have equal "badness" to kill, we don't break ties in between that resolution -- it all depends on which one shows up in the tasklist first. If you disagree with that resolution, which I support as being high enough, then you may certainly propose a patch to make it even finer at 0.01%, 0.001%, etc. It would only change oom_badness() to And we didn't. oom_adj is still there and maps linearly to oom_score_adj; you just can't show a single application where that mapping breaks because it was based on an actual calculation. If you would like to cite these "multiple" applications that need to be converted to use oom_score_adj (I know of udev), please let me know and if they're open-source applications then I will commit to submitting patches for them myself. I believe the two year window is sufficient for Please see my reply to Figo.zhang where I enumerate the four ...
On Mon, 15 Nov 2010, David Rientjes wrote: [...] I'm not going into the debate about whether or not deprecating one tunable for two years is sufficient or not. I'm simply going to mention one app that I know of that needs to be converted to use "oom_score_adj" on my box : [jj@dragon ~]$ uname -a Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 22:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux [jj@dragon ~]$ dmesg | grep oom_adj start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead. [jj@dragon ~]$ /usr/lib/kde4/libexec/start_kdeinit --version Qt: 4.7.1 KDE: 4.5.3 (KDE 4.5.3) -- Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/ Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please. --
Thanks for the report! I'll get involved with kde-devel and send a patch to remove this dependency on newer kernels to expedite the process. [ Others with reports of deprecated use of oom_adj can contact me privately and I'll find the parties of interest to avoid topics unrelated to the kernel itself on LKML. ] --
CC trimmed for sanity ... David, another one for your collection. You asked for it :-) This is CentOS-5.5 running on top of kernel 2.6.36, likely out of initrd: $ dmesg | grep deprecated [ 2.430330] nash-hotplug (67): /proc/67/oom_adj is deprecated, please use /proc/67/oom_score_adj instead. Cheers Martin --
On Tue, Nov 16, 2010 at 11:04 AM, Martin Knoblauch ...and another, on Fedora 14, 2.6.37-rc1-git11: auditd (2583): /proc/2583/oom_adj is deprecated, please use /proc/2583/oom_score_adj instead. Cheers, --alessandro "There's always a siren singing you to shipwreck" (Radiohead, "There There") --
2:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel Make that 2 common apps: % uname -a Linux turing-police.cc.vt.edu 2.6.37-rc1-mmotm1109 #1 SMP PREEMPT Wed Nov 10 12:30:17 EST 2010 x86_64 x86_64 x86_64 GNU/Linux % dmesg | grep oom [ 89.981594] sshd (4168): /proc/4168/oom_adj is deprecated, please use /proc/4168/oom_score_adj instead. % rpm -q openssh openssh-5.6p1-16.fc15.x86_64 5.6p1 is the latest-n-greatest released version on www.openssh.org, so somebody probably needs to rattle their chain...
Thanks, Darren Tucker fixed this a few hours after I reported it on the openssh bugzilla, the patch is at https://bugzilla.mindrot.org/show_bug.cgi?id=1838 -- it uses oom_score_adj if it exists and then falls back to oom_adj if running on an older kernel. --
But current openssh needs to support old kernels. This is why this kind of obsoleting doesn't work well. It's not "update your app" so much as "drop support for older stuff or start doing complicated crap dependant on version" and it's why for tiny amounts of code it is the *wrong* thing to force obsolete stuff especially when it still doesn't seem to have been properly marked for deprecation in the first place. --
On Tue, 16 Nov 2010 11:03:10 +0000 How does one mark it apropriately? The commit 51b1bd2 (oom: deprecate oom_adj tunable, see below) added it to feature-removal-schedule.txt, a patch for Documentation/ABI has also been provided in the meantime, if i'm not mistaken. And there is already a patch for openssh: https://bugzilla.mindrot.org/show_bug.cgi?id=1838 Regards, Flo commit 51b1bd2ace1595b72956224deda349efa880b693 Author: David Rientjes <rientjes@google.com> Date: Mon Aug 9 17:19:47 2010 -0700 oom: deprecate oom_adj tunable /proc/pid/oom_adj is now deprecated so that that it may eventually be removed. The target date for removal is August 2012. A warning will be printed to the kernel log if a task attempts to use this interface. Future warning will be suppressed until the kernel is rebooted to prevent spamming the kernel log. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --
Yes - so why is it spewing crap, annoying users and trying to irritate application authors. It's not 2012 yet. --
It's a WARN_ON_ONCE() so it will only spew a single line as a reminder that the application needs to be updated; would you prefer that to be suppressed until a year before removal, for example? --
[CC: list trimmed again for sanity] Another one: [ 34.709156] chromium-browse (1439): /proc/1480/oom_adj is deprecated, please use /proc/1480/oom_score_adj instead. 2.6.37-rc2 - archlinux - package chromium-browser-ppa from AUR --
Aug 2012 is only 6 kernel releases or so away.... Presumably the whinging is so we start tracking down the offending userspace and getting it fixed before 2012 gets here. Sticking the warning in just one or two kernel releases before it becomes official leads to "I can't run the new kernel because my userspace isn't patched yet". We really can't win here, we don't whinge and stuff doesn't get tracked down and fixed, we do whinge and that gets people upset too.
El Mon, 15 Nov 2010 19:13:15 -0500
$ dmesg | grep deprecated
[ 1.473365] udevd (662): /proc/662/oom_adj is deprecated, please use /proc/662/oom_score_adj instead.
$ apt-cache policy udev
udev:
Instalados: 151-12.3
Candidato: 151-12.3
Tabla de versión:
*** 151-12.3 0
500 http://es.archive.ubuntu.com/ubuntu/ lucid-proposed/main Packages
100 /var/lib/dpkg/status
151-12.2 0
500 http://es.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
151-12 0
500 http://es.archive.ubuntu.com/ubuntu/ lucid/main Packages
$ uname -a
Linux varda 2.6.36-00001-g90d39e9 #145 SMP PREEMPT Wed Oct 20 23:27:44 CEST 2010 x86_64 GNU/Linux
I can't understand. Why do I need to ignore userland folks? WHY? I have no reason userland complain. I tend to prefer to avoid userland No. Think Moore's Law. rational value will be not able to work in future anyway. 10 years ago, I used 20M bytes memory desktop machine and I'm now using 2GB. If you want, you have to change userland at first and by yourself. Don't I'm NOT interesting *powerful* crap. Please DON'T talk which is powerful. --
If you'd like to suggest an increase to the upper-bound of the badness score, please do so, although I don't think we need to break ties amongst tasks that differ by at most <0.1% of the system's capacity. --
Why? No. I dislike. I dislike propotinal score. --
There have never been any bug reports related to applications using oom_score_adj and being impacted with its linear mapping onto oom_adj's exponential scale. That's because no users prior to the rewrite were using oom_adj scores that were based on either the expected memory usage of the application nor the capacity of the machine. --
Zero question? If so, I'll resend the revert to linus. Actually, I don't tend hear the shouting. They aren't discussion. It's only crappy shout. Googlers have to think why no person agree their claim. ZERO. even though >20 people discussed with them. DavidR seems to continue to make flame. But I don't care. He have to learn making flame don't solve ANYTHING. And they have to learn correct discussion way and which is different of discusstion and shouting. and why we have to learn userland workload and have to avoid any breakage. I'm angry googlers frequently break kernel and frequently ignore userland claim. --
I did. Therefore, I will resend the patch to you. Thanks. -------------------------------------------------------------------------- Subject: [PATCH] Revert oom rewrite series This reverts following commits. They has broke an ABI and made multiple enduser claim. 9c28ab662a8e3d19d07077ac0a8931c015e8afec Revert "oom: badness heuristic rewrite" 74cd8c6cb3e093c4d67ac3eb3581e246e4981dad Revert "oom: deprecate oom_adj tunable" 79a0bd5796e754c4b4e22071c4edddef3517d010 Revert "memcg: use find_lock_task_mm() in memory cgroups oom" a465ef80c2a9fe73c85029fcea5c68ffee8dbb69 Revert "oom: always return a badness score of non-zero for eligible tas 516fcbb0c45d943df1b739d3be3d417aee2275f3 Revert "oom: filter unkillable tasks from tasklist dump" b1c98f95a7954c450dadd809280f86863ea9d05d Revert "oom: add per-mm oom disable count" fd79f3f47c82a0af5288afe7556905dd171bfc43 Revert "oom: avoid killing a task if a thread sharing its mm cannot be 2d72175528870dcef577db4a2a0b49d819c6eaff Revert "oom: kill all threads sharing oom killed task's mm" be212960618ddcdb9526ce2cb73fd081fd3e90ea Revert "oom: rewrite error handling for oom_adj and oom_score_adj tunab 1b17c41599c594c7d11ef415a92d47c205fe89ea Revert "oom: fix locking for oom_adj and oom_score_adj" Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- Documentation/feature-removal-schedule.txt | 25 --- Documentation/filesystems/proc.txt | 97 ++++----- fs/exec.c | 5 - fs/proc/base.c | 176 ++-------------- include/linux/memcontrol.h | 8 - include/linux/mm_types.h | 2 - include/linux/oom.h | 19 +-- include/linux/sched.h | 3 +- kernel/exit.c | 3 - kernel/fork.c | 16 +-- mm/memcontrol.c | 28 +--- mm/oom_kill.c | 323 ...
That's inaccurate, there haven't been multiple bug reports popping up since the rewrite; in fact, there hasn't been a single bug report. There have been two changes to the oom killer since the rewrite: - we now kill all threads sharing the oom killed task that share the ->mm since we can't free any memory without them exiting as well, and - we count threads that are immune from oom kill attached to an ->mm so we can avoid needlessly killing tasks that aren't immune themselves but have other threads sharing the ->mm that are. Both of those changes were needed in the old oom killer as well, they have nothing to do with the rewrite. Also, stating that the new heuristic doesn't address CAP_SYS_RESOURCE approrpiately isn't a bug report, it's the desired behavior. I eliminated all of the arbitrary heursitics in the old heuristic that we had the remove internally as well so that is predictable as possible and achieves the oom killer's sole goal: to kill the most memory-hogging task that is eligible to allow memory allocations in the current context to succeed. CAP_SYS_RESOURCE threads have full control over their oom killing priority by /proc/pid/oom_score_adj and need no consideration in the heuristic by default since it otherwise allows for the probability that multiple tasks will need to be killed when a CAP_SYS_RESOURCE thread uses an egregious As mentioned just a few minutes ago in another thread, there is no userspace breakage with the rewrite and you're only complaining here about the deprecation of /proc/pid/oom_adj for a period of two years. Until it's removed in 2012 or later, it maps to the linear scale that oom_score_adj uses rather than its old exponential scale that was unusable for prioritization because of (1) the extremely low resolution, and (2) the arbitrary heuristics that preceeded it. You've proposed various forms of your revert (this is the fifth one) and I've responded in a very respectful and technical way each ...
, but unless they are written in the last months and designed for linux
and if the author took some time to research each external process
invocation, they can not be aware of this possibility.
Besides that, if each process is supposed to change the default, the
If it happens to use an egregious mount of memory, it SHOULD score
1) The exponential scale did have a low resolution.
2) The heuristics were developed using much brain power and much
trial-and-error. You are going back to basics, and some people
are not convinced that this is better. I googled and I did not
find a discussion about how and why the new score was designed
this way.
looking at the output of:
cd /proc; for a in [0-9]*; do
echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
done|grep -v ^0|sort -n |less
, I 'm not convinced, too.
PS) Mapping an exponential value to a linear score is bad. E.g. A
oom_adj of 8 should make an 1-MB-process as likely to kill as
a 256-MB-process with oom_adj=0.
PS2) Because I saw this in your presentation PDF: (@udev-people)
The -17 score of udevd is wrong, since it will even prevent
the OOM killer from working correctly if it grows to 100 MB:
It's default OOM score is 13, while root's shell is at 190
and some KDE processes are at 200 000. It will not get killed
under normal circumstances.
If it udevd grows enough to score 190 as well, it has a bug
that causes it to eat memory and it needs to be killed. Having
a -17 oom_adj, it will cause the system to fail instead.
Considering udevd's size, an adj of -1 or -2 should be enough on
embedded systems, while desktop systems should not need it.
If you are worried about udevd getting killed, protect ist using
a wrapper.
--
You're clearly wrong, CAP_SYS_RESOURCE has been required to modify oom_adj for over five years (as long as the git history). 8fb4fc68, merged into 2.6.20, allowed tasks to raise their own oom_adj but not decrease it. That doesn't make any sense, if want to protect a thread from the oom killer you're going to need to modify oom_score_adj, the kernel can't know what you perceive as being vital. Having CAP_SYS_RESOURCE alone does not imply that, it only allows unbounded access to resources. That's completely orthogonal to the goal of the oom killer heuristic, which is to The old heuristics were a mixture of arbitrary values that didn't adjust scores based on a unit and would often cause the incorrect task to be targeted because there was no clear goal being achieved. The new heuristic has a solid goal: to identify and kill the most memory-hogging task that is eligible given the context in which the oom occurs. If you disagree with that goal and want any of the old heursitics reintroduced, To show that, you would have to show that an application that exists today uses an oom_adj for something other than polarization and is based on a Threads with CAP_SYS_RESOURCE are free to lower the oom_score_adj of any thread they deem fit and that includes applications that lower its own oom_score_adj. The kernel isn't going to prohibit users from setting their own oom_score_adj. --
You are misunderstanding me. It was allowed to do this, but it did not need to do it yet. It was enough to be a well-written POSIX application without The old oom killer's task was to guess the best victim to kill. For me, it did a good job (but the system kept thrashing for too long until it kicked the offender). Looking at CAP_SYS_RESOURCE was one way to recognize The first old OOM killer did the same as you promise the current one does, except for your bugfixes. That's why it killed the wrong applications and all the heuristics were added until the complaints stopped. Off cause I did not yet test your OOM killer, maybe it really is better. Heuristics tend to rot and you did much work to make it right. I don't want the old OOM killer back, but I don't want you to fall No such application should exist because the OOM killer should DTRT. oom_adj was supposed to let the sysadmin lower his mission-critical DB's score to be just lower than the less-important tasks, or to My point is: The udev people should not prevent the OOM killer unconditionally, it has an important task in case something goes wrong. I just didn't want to start a new thread at that time of day. -- How do I set my laser printer on stun? --
CAP_SYS_RESOURCE does not imply the task is important. There's a problem when the kernel is oom; killing a thread that is getting work done is one of the most serious remedies the kernel will ever do to allow forward progress. In almost all scenarios (except in some cpuset or memcg configurations), it's a userspace configuration issue that exhausts memory and the VM finds no other alternative. CAP_SYS_RESOURCE threads have access to unbounded amounts of resources and thus can use an extremely large amount of memory very quickly and at a detriment to other threads that may be as important to more important. Considering them any different is an unsubstantiated and undefined behavior that should not be considered in the heuristic _unless_ the administrator or the task itself No, the old oom killer did not always kill the application that used the most amount of memory; it considered other factors with arbitrary point deductions such as nice level, runtime, CAP_SYS_RAWIO, CAP_SYS_RESOURCE, etc. We had to remove those heuristics internally in older kernels as well because it would often allow a task to runaway using a massive amount of memory because of leaks and kill everything else on the system before targeting the appropriate task. At that point, it left the system with Thanks, and that's why I'm trying to avoid additional heuristics such CAP_SYS_RESOURCE where the priority is _implied_ rather than _proven_. If CAP_SYS_RESOURCE was defined to be more preferred to stay alive, then I'd oom_score_adj allows use to define when an application is using more memory than expected and is often helpful in cpuset, memcg, or mempolicy constrained cases as well. We'd like to be able to say that 30% of available memory should be discounted from a particular task that is expected to use 30% more memory than others without getting preferred. oom_score_adj can do that, oom_adj could not. --
Here's a patch I've been working on to control thrashing. http://lkml.org/lkml/2010/10/28/289 It works well for our app: web browser. We'd rather OOM quickly and kill a browser tab than thrash for a few minutes and then OOM. It works well for --
