/proc/pid/oom_adj was introduced before cpusets or memcg, so we were
dealing with a static amount of system resources that would not change out
from underneath an application. Although that's not a defense of using a
bitshift on a heuristic that included many arbitrary selections, it was
reasonable to make it only a scalar that didn't consider the amount of
resources that an application was allowed to access.
As time moved on, cgroups such as cpusets and memcg were introduced (and
mempolicies became much more popular as larger NUMA machines became more
popular in the industry) that bound or restricted the amount of memory
that an aggregate of tasks could access. Each of those methods may change
the amount of memory resources that an application has available to it at
any time without knowledge of that application, job scheduler, or system
daemon. Therefore, it's important to define a more powerful oom_adj
mechanism that can properly attribute the oom killing priority of a task
in comparison to others that are bound to the same resources without
causing a regression on users who don't use those mechanisms that allow
that working set size to be dynamic. That's what oom_score_adj does.
With respect to memory isolation and the oom killer, we want to follow the
upstream behavior as much as possible. We are looking forward to this
interface being available in 2.6.36 and will begin to actively use it once
it is released. So although there is no existing user today, there will
be when 2.6.36 is released. I also hope that other users of cpusets or
memcg will find it helpful to define oom killing priority with a unit
rather than a bitshift and that understands the dynamic nature of resource
isolation that they both imply.
There is no generic oom notifier solution other than what is implemented
in the memory controller, and that comes at a cost of roughly 1% of system
RAM since that's the amount of metadata that the memcg requires.
oom_score_adj works for cpusets, memcg, and mempolicies.
Now you may insist that the fact that very few users actually use
/proc/pid/oom_adj for _anything_ other than -17, -15 or +15 is unrelated,
but in fact it provides a very nice context for your objections that it
introduces all sorts of regressions and bugs that will break everybody's
system. Show me a single example of an application that tunes its oom_adj
value based on any formula whatsoever that includes its own anticipated
RAM usage or system capacity (which is required to make the bitshift make
any sense).
A generic oom notifier would be very nice to have in the kernel that
isn't dependent on memory controller. Unfortunately, many users cannot
incur the 1% of memory penalty that it comes with.
--