Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: David Rientjes
Date: Tuesday, September 7, 2010 - 8:12 pm

On Wed, 8 Sep 2010, KOSAKI Motohiro wrote:


/proc/pid/oom_adj was introduced before cpusets or memcg, so we were 
dealing with a static amount of system resources that would not change out 
from underneath an application.  Although that's not a defense of using a 
bitshift on a heuristic that included many arbitrary selections, it was 
reasonable to make it only a scalar that didn't consider the amount of 
resources that an application was allowed to access.

As time moved on, cgroups such as cpusets and memcg were introduced (and 
mempolicies became much more popular as larger NUMA machines became more 
popular in the industry) that bound or restricted the amount of memory 
that an aggregate of tasks could access.  Each of those methods may change 
the amount of memory resources that an application has available to it at 
any time without knowledge of that application, job scheduler, or system 
daemon.  Therefore, it's important to define a more powerful oom_adj 
mechanism that can properly attribute the oom killing priority of a task 
in comparison to others that are bound to the same resources without 
causing a regression on users who don't use those mechanisms that allow 
that working set size to be dynamic.  That's what oom_score_adj does.


With respect to memory isolation and the oom killer, we want to follow the 
upstream behavior as much as possible.  We are looking forward to this 
interface being available in 2.6.36 and will begin to actively use it once 
it is released.  So although there is no existing user today, there will 
be when 2.6.36 is released.  I also hope that other users of cpusets or 
memcg will find it helpful to define oom killing priority with a unit 
rather than a bitshift and that understands the dynamic nature of resource 
isolation that they both imply.


There is no generic oom notifier solution other than what is implemented 
in the memory controller, and that comes at a cost of roughly 1% of system 
RAM since that's the amount of metadata that the memcg requires.  
oom_score_adj works for cpusets, memcg, and mempolicies.

Now you may insist that the fact that very few users actually use 
/proc/pid/oom_adj for _anything_ other than -17, -15 or +15 is unrelated, 
but in fact it provides a very nice context for your objections that it 
introduces all sorts of regressions and bugs that will break everybody's 
system.  Show me a single example of an application that tunes its oom_adj 
value based on any formula whatsoever that includes its own anticipated 
RAM usage or system capacity (which is required to make the bitshift make 
any sense).


A generic oom notifier would be very nice to have in the kernel that 
isn't dependent on memory controller.  Unfortunately, many users cannot 
incur the 1% of memory penalty that it comes with.
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., KAMEZAWA Hiroyuki, (Wed Aug 25, 5:39 pm)
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., KAMEZAWA Hiroyuki, (Wed Aug 25, 6:03 pm)
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., KAMEZAWA Hiroyuki, (Wed Aug 25, 6:11 pm)
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., KAMEZAWA Hiroyuki, (Wed Aug 25, 8:20 pm)
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., David Rientjes, (Tue Sep 7, 8:12 pm)