Re: [PATCH]oom-kill: direct hardware access processes should get bonus

Previous thread: Confirm your webmail account by info_webmaster on Monday, November 1, 2010 - 6:14 pm. (1 message)

Next thread: [PATCH] dma: intel_mid_dma: fix double free on mid_setup_dma error path by Axel Lin on Monday, November 1, 2010 - 6:52 pm. (1 message)
From: Figo.zhang
Date: Monday, November 1, 2010 - 6:43 pm

the victim should not directly access hardware devices like Xorg server,
because the hardware could be left in an unpredictable state, although 
user-application can set /proc/pid/oom_score_adj to protect it. so i think
those processes should get 3% bonus for protection.

Signed-off-by: Figo.zhang <figo1802@gmail.com>
---
mm/oom_kill.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..df6a9da 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -195,10 +195,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	task_unlock(p);
 
 	/*
-	 * Root processes get 3% bonus, just like the __vm_enough_memory()
-	 * implementation used by LSMs.
+	 * Root and direct hardware access processes get 3% bonus, just like the
+	 * __vm_enough_memory() implementation used by LSMs.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+	    has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+	    has_capability_noaudit(p, CAP_SYS_RAWIO))
 		points -= 30;
 
 	/*


--

From: David Rientjes
Date: Monday, November 1, 2010 - 8:10 pm

Which applications are you referring to that cannot gracefully exit if 

LSM's have this bonus for CAP_SYS_ADMIN, but not for CAP_SYS_RAWIO, so 

CAP_SYS_RAWIO had a much more dramatic impact in the previous heuristic to 
such a point that it would often allow memory hogging tasks to elude the 
oom killer at the expense of innocent tasks.  I'm not sure this is the 
best way to go.
--

From: Figo.zhang
Date: Tuesday, November 2, 2010 - 7:24 am

like Xorg server, if xorg server be killed, the gnome desktop will be

is it some experiments for demonstration the  CAP_SYS_RAWIO will elude
the oom killer?




--

From: David Rientjes
Date: Tuesday, November 2, 2010 - 12:34 pm

Right, but you didn't explicitly prohibit such applications from being 
killed, so that suggests that doing so may be inconvenient but doesn't 
incur something like corruption or data loss, which is what I would 
consider "unstable" or "inconsistent" state.

We're trying to avoid any additional heuristics from being introduced for 
specific usecases, even for Xorg.  That ensures that the heuristic remains 
as predictable as possible and frees a large amount of memory.  If Xorg is 
being killed first instead of a true memory hogger, then it seems like a 
forkbomb scenario instead; could you please post your kernel log so that 

The old heuristic would allow it to elude the oom killer because it would 
divide the score by four if a task had the capability, which is a much 
more drastic "bonus" than you suggest here.  That would reduce the score 
for the memory hogging task significantly enough that we killed tons of 
innocent tasks instead before eventually killing the task that was leaking 
memory but failed to be identified because it had CAP_SYS_RAWIO.  I'm 
trying to avoid any such repeats.
--

From: Figo.zhang
Date: Wednesday, November 3, 2010 - 4:43 pm

CAP_SYS_RESOURCE also had better get 3% bonus for protection.

Signed-off-by: Figo.zhang <figo1802@gmail.com>
--- 
mm/oom_kill.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..30b24b9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -198,7 +198,8 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 * Root processes get 3% bonus, just like the __vm_enough_memory()
 	 * implementation used by LSMs.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
 		points -= 30;
 
 	/*


--

From: David Rientjes
Date: Wednesday, November 3, 2010 - 4:47 pm

From: Figo.zhang
Date: Wednesday, November 3, 2010 - 6:38 pm

process with CAP_SYS_RESOURCE capibility which have system resource
limits, like journaling resource on ext3/4 filesystem, RTC clock. so it
also the same treatment as process with CAP_SYS_ADMIN.

Best,

Figo.zhang



--

From: David Rientjes
Date: Wednesday, November 3, 2010 - 6:50 pm

NACK, there's no justification that these tasks should be given a 3% 
memory bonus in the oom killer heuristic; in fact, since they can allocate 
without limits it is more important to target these tasks if they are 
using an egregious amount of memory.  CAP_SYS_RESOURCE threads have the 
ability to lower their own oom_score_adj values, thus, they should protect 
themselves if necessary like everything else.
--

From: Figo.zhang
Date: Wednesday, November 3, 2010 - 7:12 pm

In your new heuristic, you also get CAP_SYS_RESOURCE to protection.
see fs/proc/base.c, line 1167:
	if (oom_score_adj < task->signal->oom_score_adj &&
			!capable(CAP_SYS_RESOURCE)) {
		err = -EACCES;
		goto err_sighand;
	}

so i want to protect some process like normal process not
CAP_SYS_RESOUCE, i set a small oom_score_adj , if new oom_score_adj is
small than now and it is not limited resource, it will not adjust, that
seems not right?





--

From: David Rientjes
Date: Wednesday, November 3, 2010 - 7:54 pm

Tasks without CAP_SYS_RESOURCE cannot lower their own oom_score_adj, 
otherwise it can trivially kill other tasks.  They can, however, increase 
their own oom_score_adj so the oom killer prefers to kill it first.

I think you may be confused: CAP_SYS_RESOURCE override resource limits.
--

From: Figo.zhang
Date: Wednesday, November 3, 2010 - 9:42 pm

CAP_SYS_RESOURCE == 1 means without resource limits just like a
superuser,
CAP_SYS_RESOURCE == 0 means hold resource limits, like normal user,
right?

a new lower oom_score_adj will protect the process, right?

Tasks without CAP_SYS_RESOURCE, means that it is not a superuser, why
user canot protect it by oom_score_adj?

like i want to protect my program such as gnome-terminal which is
without CAP_SYS_RESOURCE (have resource limits), 

[figo@myhost ~]$ ps -ax | grep gnome-ter
Warning: bad ps syntax, perhaps a bogus '-'? See
http://procps.sf.net/faq.html
 2280 ?        Sl     0:01 gnome-terminal
 8839 pts/0    S+     0:00 grep gnome-ter
[figo@myhost ~]$ cat /proc/2280/oom_adj 
3
[figo@myhost ~]$ echo -17 >  /proc/2280/oom_adj 
bash: echo: write error: Permission denied
[figo@myhost ~]$ 



--

From: David Rientjes
Date: Wednesday, November 3, 2010 - 10:08 pm

Because, as I said, it would be trivial for a user program to deplete all 
memory (either intentionally or unintentioally) and cause every other task 
on the system to be oom killed as a result.  That's an undesired result of 

If this is your system, you can either give yourself CAP_SYS_RESOURCE or 
do it through the superuser.  This isn't exactly new, it's been the case 
for the past four years.

I'm still struggling to find out the problem that you're trying to address 
with your various patches, perhaps because you haven't said what it is.
--

From: KOSAKI Motohiro
Date: Tuesday, November 9, 2010 - 4:01 am

David, Stupid are YOU. you removed CAP_SYS_RESOURCE condition with ZERO
explanation and Figo reported a regression. That's enough the reason to
undo. YOU have a guilty to explain why do you want to change and why
do you think it has justification.

Don't blame bug reporter. That's completely wrong.




--

From: Alan Cox
Date: Tuesday, November 9, 2010 - 5:24 am

Can people stop throwing things at each other and worry about the facts

- If it's a regression it should get reverted or fixed. But is it
  actually a regression ? Has the underlying behaviour changed in a
  problematic way?

"CAP_SYS_RESOURCE threads have the ability to lower their own oom_score_adj
 values, thus, they should protect themselves if necessary like
 everything else."

The reverse can be argued equally - that they can unprotect themselves if
necessary. In fact it seems to be a "point of view" sort of question
which way you deal with CAP_SYS_RESOURCE, and that to me argues that
changing from old expected behaviour to a new behaviour is a regression.



--

From: David Rientjes
Date: Tuesday, November 9, 2010 - 2:06 pm

I didn't check earlier, but CAP_SYS_RESOURCE hasn't had a place in the oom 
killer's heuristic in over five years, so what regression are we referring 
to in this thread?  These tasks already have full control over 
oom_score_adj to modify its oom killing priority in either direction.

And, as I said, giving these threads a bonus to be less preferred doesn't 
seem appropriate since (1) it's not a defined or expected behavior of 
CAP_SYS_RESOURCE like it is for sysadmin tasks, and (2) these threads are 
not bound by resource limits and thus have a higher liklihood of consuming 
larger amounts of memory.

That's why I nack'd the patch in the first place and still do, there's no 
regression here and it's not in the best interest of freeing a large 
amount of memory which is the sole purpose of the oom killer.

Futhermore, the heuristic was entirely rewritten, but I wouldn't consider 
all the old factors such as cputime and nice level being removed as 
"regressions" since the aim was to make it more predictable and more 
likely to kill a large consumer of memory such that we don't have to kill 
more tasks in the near future.
--

From: David Rientjes
Date: Tuesday, November 9, 2010 - 2:25 pm

Yes, CAP_SYS_RESOURCE was a part of the heuristic in 2.6.25 along with 
CAP_SYS_ADMIN and was removed with the rewrite; when I said it "hasn't had 
a place in the oom killer's heuristic," I meant it's an unnecessary 
extention to CAP_SYS_ADMIN and allows for killing innocent tasks when a 
CAP_SYS_RESOURCE task is using too much memory.

The fundamental issue here is whether or not we should give a bonus to 
CAP_SYS_RESOURCE tasks because they are, by definition, allowed to access 
extra resources and we're willing to sacrifice other tasks for that.  This 
is antagonist to the oom killer's sole goal, however, which is to kill the 
task consuming the largest amount of memory unless protected by userspace 
(which CAP_SYS_RESOURCE has completely control in doing).

Since these threads have complete ability to give themselves this bonus 
(echo -30 > /proc/self/oom_score_adj), I don't think this needs to be a 
part of the core heuristic nor with such an arbitrary value of 3% (the old 
heuristic divided its badness score by 4, another arbitrary value).
--

From: Figo.zhang
Date: Wednesday, November 10, 2010 - 7:38 am

yes, it can control by user, but is it all system administrators will
adjust all of the processes by each one and one in real word? suppose if

the goal of oom_killer is to find out the best process to kill, the one
should be:
1. it is a most memory comsuming process in all processes
2. and it was a proper process to kill, which will not be let system 
into unpredictable state as possible.

if a user process and a process such email cleint "evolution" with
ditecly hareware access such as "Xorg", they have eat the equal memory,
so which process are you want to kill?


--

From: David Rientjes
Date: Wednesday, November 10, 2010 - 1:50 pm

Yes, the kernel can't possibly know the oom killing priorities of your 
task so if you have such requirements then you must use the userspace 

There are four types of tasks that are improper to kill and this is 
relatively unchanged in the past five years of the oom killer:

 - init,

 - kthreads,

 - tasks that are bound to a disjoint set of cpuset mems or mempolicy 
   nodes that are not oom, and

 - those disabled from oom killing by userspace.

That does not include CAP_SYS_RESOURCE, nor CAP_SYS_ADMIN.  Your argument 
about killing some tasks that have CAP_SYS_RESOURCE leaving hardware in an 
unpredictable state isn't even addressed by your own patch, you only give 
them a 3% memory bonus so they are still eligible.

As mentioned previously, for this patch to make sense, you would need to 
show that CAP_SYS_RESOURCE equates to 3% of the available memory's 
capacity for a task.  I don't believe that evidence has been presented.  
This has nothing to do with preventing these threads from being killed (at 
the risk of possibly panicking the machine) since your patch doesn't do 

Both have equal oom killing priority according to the heuristic if they 
are not run by root.  If you would like to protect Xorg, then you need to 
use the userspace tunable to protect it just like everything else does.  
This is completely unchanged from the oom killer rewrite.

If you actually have a problem that you're reporting, however, it would 
probably be better to show the oom killer log from that event and let us 
address it instead of introducing arbitrary heuristics into something 
which aims to be as predictable as possible.
--

From: KOSAKI Motohiro
Date: Tuesday, November 9, 2010 - 3:41 am

I was surprised this issue is still there. This was pointed out half year 


But yes. OOM need to care both CAP_SYS_RESOURCE and CAP_SYS_RAWIO.

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--

From: Figo.zhang
Date: Tuesday, November 9, 2010 - 5:24 am

the victim should not directly access hardware devices like Xorg server,
because the hardware could be left in an unpredictable state, although 
user-application can set /proc/pid/oom_score_adj to protect it. so i think
those processes should get 3% bonus for protection.

in v2, fix the incorrect comment.

Signed-off-by: Figo.zhang <figo1802@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..9b06f56 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -196,9 +196,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 
 	/*
 	 * Root processes get 3% bonus, just like the __vm_enough_memory()
-	 * implementation used by LSMs.
+	 * implementation used by LSMs. And direct hardware access processes
+	 * also get 3% bonus.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+	    has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+	    has_capability_noaudit(p, CAP_SYS_RAWIO))
 		points -= 30;
 
 	/*


--

From: David Rientjes
Date: Tuesday, November 9, 2010 - 2:16 pm

The logic here is wrong: if killing these tasks can leave hardware in an 
unpredictable state (and that state is presumably harmful), then they 
should be completely immune from oom killing since you're still leaving 
them exposed here to be killed.

So the question that needs to be answered is: why do these threads deserve 
to use 3% more memory (not >4%) than others without getting killed?  If 
there was some evidence that these threads have a certain quantity of 
memory they require as a fundamental attribute of CAP_SYS_RAWIO, then I 
have no objection, but that's going to be expressed in a memory quantity 
not a percentage as you have here.

The CAP_SYS_ADMIN heuristic has a background: it is used in the oom killer 
because we have used the same 3% in __vm_enough_memory() for a long time 
and we want consistency amongst the heuristics.  Adding additional bonuses 
with arbitrary values like 3% of memory for things like CAP_SYS_RAWIO 
makes the heuristic less predictable and moves us back toward the old 
heuristic which was almost entirely arbitrary.

Now before KOSAKI-san comes out and says the old heuristic considered 
CAP_SYS_RAWIO and the new one does not so it _must_ be a regression: the 
old heuristic also divided the badness score by 4 for that capability as a 
completely arbitrary value (just like 3% is here).  Other traits like 
runtime and nice levels were also removed from the heuristic.  What needs 
to be shown is that CAP_SYS_RAWIO requires additional memory just to run 
or we should neglect to free 3% of memory, which could be gigabytes, 
because it has this trait.
--

From: Figo.zhang
Date: Wednesday, November 10, 2010 - 7:48 am

we let the processes with hardware access get bonus for protection. the


yes, i think it is be better those processes which be protection maybe
divided the badness score by 4, like old heuristic.




--

From: KOSAKI Motohiro
Date: Saturday, November 13, 2010 - 10:07 pm

That's bogus. __vm_enough_memory() does track virtual adress space. oom-killer

Old background is very simple and cleaner. 

CAP_SYS_RESOURCE mean the process has a privilege of using more resource.
then, oom-killer gave it additonal bonus.

CAP_SYS_RAWIO mean the process has a direct hardware access privilege
(eg X.org, RDB). and then, killing it might makes system crash.


In another story, somebody doubt 4x bonus is good or not. but 3% has
the same problem.




--

From: David Rientjes
Date: Sunday, November 14, 2010 - 2:29 pm

No, 3% was chosen in __vm_enough_memory() for LSMs as the comment in the 
oom killer shows:

        /*
         * Root processes get 3% bonus, just like the __vm_enough_memory()
         * implementation used by LSMs.
         */

and is described in Documentation/filesystems/proc.txt.

I think in cases of heuristics like this where we obviously want to give 
some bonus to CAP_SYS_ADMIN that there is consistency with other bonuses 

The old heuristic divided the arbitrary badness score by 4 with 
CAP_SYS_RESOURCE.  The new heuristic doesn't consider it.


As a side-effect of being given more resources to allocate, those 
applications are relatively unbounded in terms of memory consumption to 
other tasks.  Thus, it's possible that these applications are using a 
massive amount of memory (say, 75%) and now with the proposed change a 
task using 25% of memory would be killed instead.  This increases the 
liklihood that the CAP_SYS_RESOURCE thread will have to be killed 
eventually, anyway, and the goal is to kill as few tasks as possible to 
free sufficient amount of memory.

Since threads having CAP_SYS_RESOURCE have full control over their 
oom_score_adj, they can take the additional precautions to protect 
themselves if necessary.  It doesn't need to be a part of the heuristic to 
bias these tasks which will lead to the undesired result described above 

Then you would want to explicitly filter these tasks from oom kill just as 
OOM_SCORE_ADJ_MIN works rather than giving them a memory quantity bonus.
--

From: KOSAKI Motohiro
Date: Sunday, November 14, 2010 - 6:24 pm

Keep comparision apple to apple. vm_enough_memory() account _virtual_ memory.

You are talking two difference at once. 3% vs 4x and CAP_SYS_RESOURCE and
CAP_SYS_ADMIN.


No. Why does userland recover your mistake?




--

From: David Rientjes
Date: Monday, November 15, 2010 - 3:03 am

It's not unrelated, the LSM function gives an arbitrary 3% bonus to 
CAP_SYS_ADMIN.  Such threads should also be preferred in the oom killer 
over other threads since they tend to be more important but not an overly 
drastic bias such that they don't get killed when using an egregious 
amount of memory.  So in selecting a small percentage of memory that tends 
to be a significant bias but not overwhelming, I went with the 3% found 
elsewhere in the kernel.  __vm_enough_memory() doesn't have that 
preference for any scientifically calculated reason, it's a heuristic just 

You just said killing any CAP_SYS_RAWIO task may make the system crash, so 
presuming that you don't want the system to crash, you are suggesting we 
should make these threads completely immune?  That's never been the case 
(and isn't for oom_kill_allocating_task, either), so there's no history 
you can draw from to support your argument.
--

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 12:16 am

__vm_enough_memory() only gurard to memory overcommiting. And it doesn't
have any recover way. We expect admin should recover their HAND. In the
other hand, oom-killer _is_ automatic recover way. It's no need admin's 

No. I only require YOU have to investigate userland usecase BEFORE making
change.



--

From: David Rientjes
Date: Saturday, November 27, 2010 - 6:36 pm

I needed a small bias for CAP_SYS_ADMIN tasks so I chose 3% since it's the 
same proportion used elsewhere in the kernel and works nicely since the 
badness score is now a proportion.  If you'd like to propose a different 
percentage or suggest removing the bias for root tasks altogether, feel 
free to propose a patch.  Thanks!
--

From: KOSAKI Motohiro
Date: Tuesday, November 30, 2010 - 6:00 am

I only need to revert bad change.


Thanks.



--

From: David Rientjes
Date: Tuesday, November 30, 2010 - 1:05 pm

We have always preferred to break ties between applications by not 
preferring the root task over the user task in the oom killer.  If you'd 
like to remove this bonus for CAP_SYS_ADMIN, please propose a patch.  
Thanks!
--

From: Figo.zhang
Date: Wednesday, November 10, 2010 - 8:14 am

the victim should not directly access hardware devices like Xorg server,
because the hardware could be left in an unpredictable state, although 
user-application can set /proc/pid/oom_score_adj to protect it. so i think
those processes should get bonus for protection.

in v2, fix the incorrect comment.
in v3, change the divided the badness score by 4, like old heuristic for protection. we just
want the oom_killer don't select Root/RESOURCE/RAWIO process as possible.

suppose that if a user process A such as email cleint "evolution" and a process B with
ditecly hareware access such as "Xorg", they have eat the equal memory (the badness score is 
the same),so which process are you want to kill? so in new heuristic, it will kill the process B.
but in reality, we want to kill process A.

Signed-off-by: Figo.zhang <figo1802@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
mm/oom_kill.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..f43d759 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -202,6 +202,15 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 		points -= 30;
 
 	/*
+	 * Root and direct hareware access processor are usually more 
+	 * important, so them should get bonus for protection. 
+	 */
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+	    has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+	    has_capability_noaudit(p, CAP_SYS_RAWIO))
+		points /= 4;
+
+	/*
 	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
 	 * either completely disable oom killing or always prefer a certain
 	 * task.


--

From: Figo.zhang
Date: Wednesday, November 10, 2010 - 8:24 am

the victim should not directly access hardware devices like Xorg server,
because the hardware could be left in an unpredictable state, although 
user-application can set /proc/pid/oom_score_adj to protect it. so i think
those processes should get bonus for protection.

in v2, fix the incorrect comment.
in v3, change the divided the badness score by 4, like old heuristic for protection. we just
want the oom_killer don't select Root/RESOURCE/RAWIO process as possible.

suppose that if a user process A such as email cleint "evolution" and a process B with
ditecly hareware access such as "Xorg", they have eat the equal memory (the badness score is 
the same),so which process are you want to kill? so in new heuristic, it will kill the process B.
but in reality, we want to kill process A.

Signed-off-by: Figo.zhang <figo1802@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..f43d759 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -202,6 +202,15 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 		points -= 30;
 
 	/*
+	 * Root and direct hareware access processes are usually more 
+	 * important, so they should get bonus for protection. 
+	 */
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+	    has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+	    has_capability_noaudit(p, CAP_SYS_RAWIO))
+		points /= 4;
+
+	/*
 	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
 	 * either completely disable oom killing or always prefer a certain
 	 * task.


--

From: David Rientjes
Date: Wednesday, November 10, 2010 - 2:00 pm

Again, this argument doesn't work: if killing the task leaves hardware in 
an unpredictable state (and that's presumably harmful), then they 
shouldn't be killed at all.

Please show why CAP_SYS_RESOURCE equates to 3% additional memory for such 
tasks.

CAP_SYS_RESOURCE allows those threads to override resource limits, so 
these have potentially unbounded amounts of memory usage.  Thus, they may 
have the highest memory usage on the machine and now your patch has caused 
other innocent tasks to be killed before this is actually targeted.  
That's a bad result.  Why do we need this type of hack in the oom killer 
when these threads have the privilege to modify oom killing priorities for 
all tasks on the system?  Laziness, at the cost of a less predictable 
heuristic?


Then you need to protect process B accordingly and since it has 
CAP_SYS_RESOURCE it can easily do that on its own or the admin can protect 

Unless you did this in private, I didn't see KOSAKI-san's reviewed-by line 
for this change and it is drastically different from what you've proposed 

What on earth?  So now CAP_SYS_ADMIN gets a 3% bonus in the if-clause 
above this, then we divide a percentage of memory use by 4?  What does 
that mean AT ALL?

And now you've thrown CAP_SYS_RAWIO in there without any mention in the 
changelog?

Are you just trying to introduce all the old arbitrary heuristics from 
before the rewrite back into the oom killer like this?

Do you actually have a log from an event where the oom killer targeted the 
incorrect task?
--

From: KOSAKI Motohiro
Date: Saturday, November 13, 2010 - 10:21 pm

Sorry for the delay. I've sent completely revert patch to linus. It will
disappear your headache, I believe. I'm sorry that our development
caused your harm. We really don't want it.




--

From: David Rientjes
Date: Sunday, November 14, 2010 - 2:33 pm

Oh please, your dramatics are getting better and better.

Figo.zhang never described a problem that was being addressed but rather 
proposed several different variants of a patch (some with CAP_SYS_ADMIN, 
some with CAP_SYS_RESOURCE, some with CAP_SYS_RAWIO, some with a 
combination, some with a 3% bonus, some with a order-of-2 bonus, etc) to 
return the same heuristic used in the old oom killer.  I asked several 
times to show the oom killer log from the problematic behavior and none 
were presented.
--

From: Figo.zhang
Date: Sunday, November 14, 2010 - 8:26 pm

>Nothing to say, really.  Seems each time we're told about a bug or a
 >regression, David either fixes the bug or points out why it wasn't a
 >bug or why it wasn't a regression or how it was a deliberate behaviour
 >change for the better.

 >I just haven't seen any solid reason to be concerned about the state of
 >the current oom-killer, sorry.

 >I'm concerned that you're concerned!  A lot.  When someone such as
 >yourself is unhappy with part of MM then I sit up and pay attention.
 >But after all this time I simply don't understand the technical issues
 >which you're seeing here.

we just talk about oom-killer technical issues.

i am doubt that a new rewrite but the athor canot provide some evidence 
and experiment result, why did you do that? what is the prominent change 
for your new algorithm?

as KOSAKI Motohiro said, "you removed CAP_SYS_RESOURCE condition with 
ZERO explanation".

David just said that pls use userspace tunable for protection by 
oom_score_adj. but may i ask question:

1. what is your innovation for your new algorithm, the old one have the 
same way for user tunable oom_adj.

2. if server like db-server/financial-server have huge import processes 
(such as root/hardware access processes)want to be protection, you let 
the administrator to find out which processes should be protection. you
will let the  financial-server administrator huge crazy!! and lose so 
many money!! ^~^

3. i see your email in LKML, you just said
"I have repeatedly said that the oom killer no longer kills KDE when run 
on my desktop in the presence of a memory hogging task that was written 
specifically to oom the machine."
http://thread.gmane.org/gmane.linux.kernel.mm/48998

so you just test your new oom_killer algorithm on your desktop with KDE, 
so have you provide the detail how you do the test? is it do the
experiment again for anyone and got the same result as your comment ?

as KOSAKI Motohiro said, in reality word, it we makes 5-6 brain 
simulation, embedded, ...
From: David Rientjes
Date: Monday, November 15, 2010 - 3:14 am

The goal was to make the oom killer heuristic as predictable as possible 
and to kill the most memory-hogging task to avoid having to recall it and 
needlessly kill several tasks.

The goal behind oom_score_adj vs. oom_adj was for several reasons, as 
pointed out before:

 - give it a unit (proportion of available memory), oom_adj had no unit,

 - allow it to work on a linear scale for more control over 
   prioritization, oom_adj had an exponential scale,

 - give it a much higher resolution so it can be fine-tuned, it works with 
   a granularity of 0.1% of memory (~128M on a 128G machine), and

 - allow it to describe the oom killing priority of a task regardless of 
   its cpuset attachment, mempolicy, or memcg, or when their respective

You have full control over disabling a task from being considered with 
oom_score_adj just like you did with oom_adj.  Since oom_adj is 

Xorg tends to be killed less because of the change to the heuristic's 
baseline, which is now based on rss and swap instead of total_vm.  This is 
seperate from the issues you list above, but is a benefit to the oom 
killer that desktop users especially will notice.  I, personally, am 
interested more in the server market and that's why I looked for a more 
robust userspace tunable that would still be applicable when things like 
cpusets have a node added or removed.
--

From: Alan Cox
Date: Monday, November 15, 2010 - 3:57 am

Meta question - why is that a good thing. In a desktop environment it's
frequently wrong, in a server environment it is often wrong. We had this
before where people spend months fiddling with the vm and make it work
slightly differently and it suits their workload, then other workloads go

Which changeset added it to the Documentation directory as deprecated ?

Alan
--

From: David Rientjes
Date: Monday, November 15, 2010 - 1:54 pm

Most of the arbitrary heuristics were removed from oom_badness(), things 
like nice level, runtime, CAP_SYS_RESOURCE, etc., so that we only consider 
the rss and swap usage of each application in comparison to each other 
when deciding which task to kill.  We give root tasks a 3% bonus since 
they tend to be more important to the productivity or uptime of the 
machine, which did exist -- albeit with a more dramatic impact -- in the 
old heursitic.

You'll find that the new heuristic always kills the task consuming the 
most amount of rss unless influenced by userspace via the tunables (or 
within 3% of root tasks).

We always want to kill the most memory-hogging task because it avoids 
needlessly killing additional tasks when we must immediately recall the 
oom killer because we continue to allocate memory.  If that task happens 
to be of vital importance to userspace, then the user has full control 

51b1bd2a was the actual change that deprecated it, which was a direct 
follow-up to a63d83f4 which actually obsoleted it.
--

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 12:16 am

It's insufficient.
a63d83f427fbce97a6cea0db2e64b0eb8435cd10 (oom: badness heuristic rewrite)
introduced a lot of incompatibility to oom_adj and oom_score.
Theresore I would sugestted full revert and resubmit some patches which
cherry pick no pain piece.


--

From: Figo.zhang
Date: Tuesday, January 4, 2011 - 12:51 am

i had send the patch to protect the hardware access processes for 
oom-killer before, but rientjes have not agree with me.

but today i catch log from my desktop. oom-killer have kill my "minicom" 
and "Xorg". so i think it should add protection about it.


my desktop run on linux-2.6.36.

[figo@figo-desktop android]$ uname -a
Linux figo-desktop 2.6.36-ARCH #1 SMP PREEMPT Fri Dec 10 20:01:53 UTC 
2010 i686 Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz GenuineIntel GNU/Linux
[figo@figo-desktop android]$








--

From: KAMEZAWA Hiroyuki
Date: Tuesday, January 4, 2011 - 1:28 am

On Tue, 04 Jan 2011 15:51:44 +0800

Off topic.


... This means total_swap_pages = 0 while pages are read-in at swapoff.

Let's see 'points' for oom 
==
points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
                        totalpages;
==

Here, totalpages = total_ram + total_swap but totalswap is 0 here.

So, points can be > 1000, easily.
(This seems not to be related to the Xorg's death itself)



Thanks,
-Kame

--

From: Figo.zhang
Date: Tuesday, January 4, 2011 - 1:56 am

total_swap is 0, so
totalpages = total_ram,
get_mm_counter(p->mm, MM_SWAPENTS) = 0,

so
points = (get_mm_rss(p->mm)) * 1000 / totalpages;


--

Previous thread: Confirm your webmail account by info_webmaster on Monday, November 1, 2010 - 6:14 pm. (1 message)

Next thread: [PATCH] dma: intel_mid_dma: fix double free on mid_setup_dma error path by Axel Lin on Monday, November 1, 2010 - 6:52 pm. (1 message)