Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

Previous thread: [RFC Patch 1/1] input:synaptics rmi4 touchpad driver support by Naveen Kumar GADDIPATI on Wednesday, August 25, 2010 - 2:37 am. (1 message)

Next thread: [PATCH] x86: EuroBraille/Iris power off by Shérab on Wednesday, August 25, 2010 - 2:49 am. (21 messages)
From: KOSAKI Motohiro
Date: Wednesday, August 25, 2010 - 2:42 am

Current oom_score_adj is completely broken because It is strongly bound
google usecase and ignore other all.

1) Priority inversion
   As kamezawa-san pointed out, This break cgroup and lxr environment.
   He said,
	> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
	> And A uses 200M, B uses 1G of memory under 4G system
	>
	> Under the system.
	> 	A's socre = (200M *1000)/4G + 300 = 350
	> 	B's score = (1G * 1000)/4G = 250.
	>
	> In the cpuset, it has 2G of memory.
	> 	A's score = (200M * 1000)/2G + 300 = 400
	> 	B's socre = (1G * 1000)/2G = 500
	>
	> This priority-inversion don't happen in current system.

2) Ratio base point don't works large machine
   oom_score_adj normalize oom-score to 0-1000 range.
   but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
   1GB. this is no suitable for tuning parameter.
   As I said, proposional value oriented tuning parameter has
   scalability risk.

3) No reason to implement ABI breakage.
   old tuning parameter mean)
	oom-score = oom-base-score x 2^oom_adj
   new tuning parameter mean)
	oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)
   but "oom_score_adj / (totalram + totalswap)" can be calculated in
   userland too. beucase both totalram and totalswap has been exporsed by
   /proc. So no reason to introduce funny new equation.

4) totalram based normalization assume flat memory model.
   example, the machine is assymmetric numa. fat node memory and thin
   node memory might have another wight value.
   In other word, totalram based priority is a one of policy. Fixed and
   workload depended policy shouldn't be embedded in kernel. probably.

Then, this patch remove *UGLY* total_pages suck completely. Googler
can calculate it at userland!

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c        |   33 ++---------
 include/linux/oom.h   |   16 +-----
 include/linux/sched.h |    2 +-
 mm/oom_kill.c         |  142 ...
From: KOSAKI Motohiro
Date: Wednesday, August 25, 2010 - 2:42 am

oom_adj is not only used for kernel knob, but also used for
application interface.
Then, adding new knob is no good reason to deprecate it.

Also, after former patch, oom_score_adj can't be used for setting
OOM_DISABLE. We need "echo -17 > /proc/<pid>/oom_adj" thing.

This reverts commit 51b1bd2ace1595b72956224deda349efa880b693.
---
 Documentation/feature-removal-schedule.txt |   25 -------------------------
 Documentation/filesystems/proc.txt         |    3 ---
 fs/proc/base.c                             |    8 --------
 include/linux/oom.h                        |    3 ---
 4 files changed, 0 insertions(+), 39 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 842aa9d..aff4d11 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -151,31 +151,6 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
-What:	/proc/<pid>/oom_adj
-When:	August 2012
-Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
-	badness heuristic used to determine which task to kill when the kernel
-	is out of memory.
-
-	The badness heuristic has since been rewritten since the introduction of
-	this tunable such that its meaning is deprecated.  The value was
-	implemented as a bitshift on a score generated by the badness()
-	function that did not have any precise units of measure.  With the
-	rewrite, the score is given as a proportion of available memory to the
-	task allocating pages, so using a bitshift which grows the score
-	exponentially is, thus, impossible to tune with fine granularity.
-
-	A much more powerful interface, /proc/<pid>/oom_score_adj, was
-	introduced with the oom killer rewrite that allows users to increase or
-	decrease the badness() score linearly.  This interface will replace
-	/proc/<pid>/oom_adj.
-
-	A warning will be emitted to the kernel log if an application uses this
-	deprecated interface.  After ...
From: David Rientjes
Date: Wednesday, August 25, 2010 - 3:27 am

Since I nacked the parent patch of this, I implicitly nack this one as 
well since oom_score_adj shouldn't be going anywhere.  The way to disable 
oom killing for a task via the new interface, /proc/pid/oom_score_adj, is 
by OOM_SCORE_ADJ_MIN as specified in the documentation.
--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 3:25 am

That's wrong, we don't even use this heuristic yet and there is nothing, 

You continually bring this up, and I've answered it three times, but 
you've never responded to it before and completely ignore it.  I really 
hope and expect that you'll participate more in the development process 
and not continue to reinterate your talking points when you have no answer 
to my response.

You're wrong, especially with regard to cpusets, which was formally part 
of the heuristic itself.

Users bind an aggregate of tasks to a cgroup (cpusets or memcg) as a means 
of isolation and attach a set of resources (memory, in this case) for 
those tasks to use.  The user who does this is fully aware of the set of 
tasks being bound, there is no mystery or unexpected results when doing 
so.  So when you set an oom_score_adj for a task, you don't necessarily 
need to be aware of the set of resources it has available, which is 
dynamic and an attribute of the system or cgroup, but rather the priority 
of that task in competition with other tasks for the same resources.

_That_ is what is important in having a userspace influence on a badness 
heursitic: how those badness scores compare relative to other tasks that 
share the same resources.  That's how a task is chosen for oom kill, not 
because of a static formula such as you're introducing here that outputs a 
value (and, thus, a priority) regardless of the context in which the task 
is bound.

That also means that the same task is not necessarily killed in a 
cpuset-constrained oom compared to a system-wide oom.  If you bias a task 
by 30% of available memory, which Kame did in his example above, it's 
entirely plausible that task A should be killed because it's actual usage 
is only 1/20th of the machine.  When its cpuset is oom, and the admin has 
specifically bound that task to only 2G of memory, we'd natually want to 
kill the memory hogger, that is using 50% of the total memory available to 

So you'd rather use the range of oom_adj ...
From: KAMEZAWA Hiroyuki
Date: Wednesday, August 25, 2010 - 5:39 pm

On Wed, 25 Aug 2010 03:25:25 -0700 (PDT)

I'm now trying to write a userspace tool to calculate this, for me.
Then, could you update documentation ? 
==
3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------

This file can be used to check the current score used by the oom-killer is for
any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
process should be killed in an out-of-memory situation.
==

add a some documentation like:
==
(For system monitoring tool developpers, not for usual users.)
oom_score calculation is implemnentation dependent and can be modified without
any caution. But current logic is

oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj

proc's rss and swap can be obtained by /proc/<pid>/statm and available ram + swap
is dependent on the situation.
If the system is totaly under oom,
	available ram  == /proc/meminfo's MemTotal
	available swap == in most case == /proc/meminfo's SwapTotal
When you use memory cgroup,
	When swap is limited,  avaliable ram + swap == memory cgroup's memsw limit.
	When swap is unlimited, avaliable ram + swap = memory cgroup's memory limit + 
							SwapTotal

Then, please be careful that oom_score's order among tasks depends on the
situation. Assume 2 proceses A, B which has oom_score_adj of 300 and 0
 And A uses 200M, B uses 1G of memory under 4G system

 Under the 4G system.
 	A's socre = (200M *1000)/4G + 300 = 350
 	B's score = (1G * 1000)/4G = 250.

	 In the memory cgroup, it has 2G of resource.
	A's score = (200M * 1000)/2G + 300 = 400
	B's socre = (1G * 1000)/2G = 500

You shoudn't depend on /proc/<pid>/oom_score if you have to handle OOM under
cgroups and cpuset. But the logic is simple.
==

If you don't want, I'll add text and a sample tool to cgroup/memory.txt.


Thanks,
-Kame









--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 5:52 pm

You'll want to look at section 3.1 of Documentation/filesystems/proc.txt, 

I'd hesitate to state the formula outside of the implementation and 
instead focus on the semantics of oom_score_adj (as a proportion of 
available memory compared to other tasks), which I tried doing in section 
3.1.  Then, the userspace tool only need be concerned about the units of 
oom_score_adj rather than whether rss, swap, or later extentions such as 
shm are added.

Thanks for working on this, Kame!
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, August 25, 2010 - 6:03 pm

On Wed, 25 Aug 2010 17:52:06 -0700 (PDT)
BTW, why you don't subtract the amount of Hugepages ?

The old code did
	"totalrampages - hugepage" as available memory.

IIUC, the number of hugepages is not accounted into mm->rss, so, isn't it
better to subtract # of hugepage ?
Hmm...makes no difference ?

Thanks,
-Kame






--

From: KAMEZAWA Hiroyuki
Date: Wednesday, August 25, 2010 - 6:11 pm

On Wed, 25 Aug 2010 17:52:06 -0700 (PDT)
Hmm. I'll add a text like following to cgroup/memory.txt. O.K. ?

==
Notes on oom_score and oom_score_adj.

oom_score is calculated as
	oom_score = (taks's proportion of memory) + oom_score_adj.

Then, when you use oom_score_adj to control the order of priority of oom,
you should know about the amount of memory you can use.
So, an approximate oom_score under memcg can be

 memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit
		+ oom_score_adj.

And yes, this can be affected by hierarchy control of memcg and calculation
will be more complicated. See, oom_disable feature also.
==

Thanks,
-Kame












--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 7:50 pm

I'd replace "memory" with "memory limit (or memsw limit)" so it's clear 

Hmm, you need to know the amount of memory that you can use iff you know 
the memcg limit and it's a static value.  Otherwise, you only need to know 
the "memory usage of your application relative to others in the same 
cgroup."  An oom_score_adj of +300 adds 30% of that memcg's limit to the 
task, allowing all other tasks to use 30% more memory than that task with 
it still be killed.  An oom_score_adj of -300 allows that task to use 30% 
more memory than other tasks without getting killed.  These don't need to 

Right, that's the exact score within the memcg.

But, I still wouldn't encourage a formula like this because the memcg 
limit (or cpuset mems, mempolicy nodes, etc) are dynamic and may change 
out from under us.  So it's more important to define oom_score_adj in the 
user's mind as a proportion of memory available to be added (either 
positively or negatively) to its memory use when comparing it to other 
tasks.  The point is that the memcg limit isn't interesting in this 
formula, it's more important to understand the priority of the task 
_compared_ to other tasks memory usage in that memcg.

It probably would be helpful, though, if you know that a vital system task 
uses 1G, for instance, in a 4G memcg that an oom_score_adj of -250 will 
disable oom killing for it.  If that tasks leaks memory or becomes 
significantly large, for whatever reason, it could be killed, but we _can_ 
discount the 1G in comparison to other tasks as the "cost of doing 
business" when it comes to vital system tasks:

	(memory usage) * (memory+swap limit / system memory)
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, August 25, 2010 - 8:20 pm

On Wed, 25 Aug 2010 19:50:22 -0700 (PDT)


yes. For defineing/understanding priority, oom_score_adj is that.


yes. under 8G system, -250 will allow ingnoring 2G of usage.

== How about this text ? ==

When you set a task's oom_score_adj, it can get priority not to be oom-killed.
oom_score_adj gives priority proportional to the memory limitation.

Assuming you set -250 to oom_score_adj.

Under 4G memory limit, it gets 25% of bonus...1G memory bonus for avoiding OOM.
Under 8G memory limit, it gets 25% of bonus...2G memory bonus for avoiding OOM.

Then, what bonus a task can get depends on the context of OOM. If you use
oom_score_adj and want to give bonus to a task, setting it in regard with
minimum memory limitation which a task is under will work well.
==

Thanks,
-Kame

--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 8:52 pm

I understand it's a little tricky when dealing with memcg-constrained oom 
conditions versus system-wide oom conditions.  Think of it this way: if 
the system is oom, then every memcg, every cpuset, and every mempolicy is 
also oom.  That doesn't imply that something in every memcg, every cpuset, 
or every mempolicy must be killed, however.  What cgroup happens to be 
penalized in this scenario isn't necessarily the scope of oom_score_adj's 
purpose.  oom_score_adj certainly does have a stronger influence over a 
task's priority when it's a system oom and not a memcg oom because the 
size of available memory is different, but that's fine: we set positive 
and negative oom_score_adj values for a reason based on the application, 
and that's not necessarily (but can be) a function of the memcg or system 
capacity.  Again, oom_score_adj is only meaningful when considered 
relative to other candidate tasks since the badness score itself is 
considered relative to other candidate tasks.

You can have multiple tasks that have +1000 oom_score_adj values (or 
multiple tasks that have +15 oom_adj values).  Only one will be killed and 
it's dependent only on the ordering of the tasklist.  That isn't an 

If the memcg limit changes because we're attaching more tasks, yes, we may 
want to change its oom_score_adj relative to those tasks.  So 
oom_score_adj is a function of the attached tasks and its allowed set of 

Yeah, that conversion could be useful if the system RAM capacity or memcg 


Very nice, and the "bonus" there is what the task can safely use in 
comparison to any other task competing for the same resources without 
getting selected itself because of that memory.
--

From: KOSAKI Motohiro
Date: Sunday, August 29, 2010 - 7:58 pm

Please show us an evidence. Big mouth is no good way to persuade us.

Yes, I ignored. Don't talk your dream. I hope to see concrete use-case.
As I repeatedly said, I don't care you while you ignore real world end user.
ANY BODY DON'T EXCEPT STABILIZATION DEVELOPERS ARE KINDFUL FOR END USER
HARMFUL. WE HAVE NO MERCY WHILE YOU CONTINUE TO INMORAL DEVELOPMENT.

I'm waiting ome more day. Pray! anyone join to this discussion and
explain real use instead you. We don't ignore end-user. But nobody
except you reponce this even though I don't care your , I definitely 


I agree your implementation works fine if admins have the same policy

No. As I said,
 - If you want to solve minority issue, you have to keep no regression
   for majority user.
 - If you want to solve major isssue and making bug change. Investigate
   world wide use case carefully. and refrect it.

oom_score_adj was pointed out it overlook a lot of use case. then I

you equetion can be changed 

	(rss + swap)  + oom_score_adj x (available ram + swap)
	-----------------------------------------------------------
		(available ram + swap)



I already explained asymmetric numa issue in past. again, don't assuem


Sorry, I don't care this. Please fix you.

Thanks.



--

From: David Rientjes
Date: Wednesday, September 1, 2010 - 3:06 pm

We are certainly looking forward to using this when 2.6.36 is released 

I'm not ignoring any user with this change, oom_score_adj is an extremely 
powerful interface for users who want to use it.  I'm sorry that it's not 
as simple to use as you may like.

Basically, it comes down to this: few users actually tune their oom 
killing priority, period.  That's partly because they accept the oom 
killer's heuristics to kill a memory-hogging task or use panic_on_oom, or 
because the old interface, /proc/pid/oom_adj, had no unit and no logical 
way of using it other than polarizing it (either +15 or -17).

For those users who do change their oom killing priority, few are using 
cpusets or memcg.  Yes, the priority changes depending on the context of 
the oom, but for users who don't use these cgroups the oom_score_adj unit 
is static since the amount of system memory (the only oom constraint) is 
static.

Now, for the users of both oom_score_adj and cpusets or memcg (in the 
future this will include Google), these users are interested in oom 
killing priority relative to other tasks attached to the same set of 
resources.  For our particular use case, we attach an aggregate of tasks 
to a cgroup and have a preference on the order in which those tasks are 
killed whenever that cgroup limit is exhausted.  We also care about 
protecting vital system tasks so that they aren't targeted before others 
are killed, such as job schedulers.

I think the key point your missing in our use case is that we don't 
necessary care about the system-wide oom condition when we're running with 
cpusets or memcg.  We can protect tasks with negative oom_score_adj, but 
we don't care about close tiebreakers on which cpuset or memg is penalized 
when the entire system is out of memory.  If that's the case, each cpuset 
and memcg is also, by definition, out of memory, so they are all subject 
to the oom killer.  This is equivalent to having several tasks with an 
oom_score_adj of +1000 (or oom_adj of ...
From: KOSAKI Motohiro
Date: Tuesday, September 7, 2010 - 7:44 pm

Of cource, there is simply just zero justification. Who ask google usage?
Every developer have their own debugging and machine administate patches.
Don't you concern why we don't push them into upstream? only one usage is

Could you please be serious? We are not making a sand castle, we are making
a kernel. You have to understand the difference of them. Zero user feature

Unrelated. We already have oom notifier. we have no reason to add new knob

memcg limit already have been exposed via /cgroup/memory.limit_in_bytes.
It's clearly userland role.

The fact is, my patch is more powerful than yours because your patch
has fixed oom management policy, but mine don't. It can be custermized
to adjust customer.

More importantly, We already have oom notifier. and It is most powerful
infrastructure. It can be constructed any oom policy freely. It's not 
restricted kernel implementaion.


The fact is, your new interface don't match HPC, Server, Banking systems
and embedded, AFAIK. At least I couldn't find such usercase in my job 
experience of such area. Also, now I'm jourlist theresore I have some
connection of linux user group. but I didn't get positive feedback for
your. instead got some negative feedback. Of cource, I don't know all of
the world theresore I did ask you real world usercase repeatedly. But

Unrelated. You are still talking about your policy. Why do we need care it?

Well, this is clearly bug. oom_adj was changed behavior. and It was deprecated
by mistake, therefore latest kernel output pointless warnings each boot time.

That said, Be serious! otherwise GO AWAY.


--

From: David Rientjes
Date: Tuesday, September 7, 2010 - 8:12 pm

/proc/pid/oom_adj was introduced before cpusets or memcg, so we were 
dealing with a static amount of system resources that would not change out 
from underneath an application.  Although that's not a defense of using a 
bitshift on a heuristic that included many arbitrary selections, it was 
reasonable to make it only a scalar that didn't consider the amount of 
resources that an application was allowed to access.

As time moved on, cgroups such as cpusets and memcg were introduced (and 
mempolicies became much more popular as larger NUMA machines became more 
popular in the industry) that bound or restricted the amount of memory 
that an aggregate of tasks could access.  Each of those methods may change 
the amount of memory resources that an application has available to it at 
any time without knowledge of that application, job scheduler, or system 
daemon.  Therefore, it's important to define a more powerful oom_adj 
mechanism that can properly attribute the oom killing priority of a task 
in comparison to others that are bound to the same resources without 
causing a regression on users who don't use those mechanisms that allow 

With respect to memory isolation and the oom killer, we want to follow the 
upstream behavior as much as possible.  We are looking forward to this 
interface being available in 2.6.36 and will begin to actively use it once 
it is released.  So although there is no existing user today, there will 
be when 2.6.36 is released.  I also hope that other users of cpusets or 
memcg will find it helpful to define oom killing priority with a unit 
rather than a bitshift and that understands the dynamic nature of resource 

There is no generic oom notifier solution other than what is implemented 
in the memory controller, and that comes at a cost of roughly 1% of system 
RAM since that's the amount of metadata that the memcg requires.  
oom_score_adj works for cpusets, memcg, and mempolicies.

Now you may insist that the fact that very few users actually ...
Previous thread: [RFC Patch 1/1] input:synaptics rmi4 touchpad driver support by Naveen Kumar GADDIPATI on Wednesday, August 25, 2010 - 2:37 am. (1 message)

Next thread: [PATCH] x86: EuroBraille/Iris power off by Shérab on Wednesday, August 25, 2010 - 2:49 am. (21 messages)