Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: David Rientjes
Date: Wednesday, August 25, 2010 - 3:25 am

On Wed, 25 Aug 2010, KOSAKI Motohiro wrote:


That's wrong, we don't even use this heuristic yet and there is nothing, 
in any way, that is specific to Google.


You continually bring this up, and I've answered it three times, but 
you've never responded to it before and completely ignore it.  I really 
hope and expect that you'll participate more in the development process 
and not continue to reinterate your talking points when you have no answer 
to my response.

You're wrong, especially with regard to cpusets, which was formally part 
of the heuristic itself.

Users bind an aggregate of tasks to a cgroup (cpusets or memcg) as a means 
of isolation and attach a set of resources (memory, in this case) for 
those tasks to use.  The user who does this is fully aware of the set of 
tasks being bound, there is no mystery or unexpected results when doing 
so.  So when you set an oom_score_adj for a task, you don't necessarily 
need to be aware of the set of resources it has available, which is 
dynamic and an attribute of the system or cgroup, but rather the priority 
of that task in competition with other tasks for the same resources.

_That_ is what is important in having a userspace influence on a badness 
heursitic: how those badness scores compare relative to other tasks that 
share the same resources.  That's how a task is chosen for oom kill, not 
because of a static formula such as you're introducing here that outputs a 
value (and, thus, a priority) regardless of the context in which the task 
is bound.

That also means that the same task is not necessarily killed in a 
cpuset-constrained oom compared to a system-wide oom.  If you bias a task 
by 30% of available memory, which Kame did in his example above, it's 
entirely plausible that task A should be killed because it's actual usage 
is only 1/20th of the machine.  When its cpuset is oom, and the admin has 
specifically bound that task to only 2G of memory, we'd natually want to 
kill the memory hogger, that is using 50% of the total memory available to 
it.


So you'd rather use the range of oom_adj from -17 to +15 instead of 
oom_score_adj from -1000 to +1000 where each point is 68GB?  I don't 
understand your point here as to why oom_score_adj is worse.

But, yes, in reality we don't really care about the granularity so much 
that we need to prioritize a task using 512MB more memory than another to 
break the tie on a 1TB machine, 1/2048th of its memory.


Everybody knows this is useless beyond polarizing a task for kill or 
making it immune.


This, on the other hand, has an actual unit (proportion of available 
memory) that can be used to prioritize tasks amongst those competing for 
the same set of shared resources and remains constant even when a task 
changes cpuset, its memcg limit changes, etc.

And your equation is wrong, it's

	((rss + swap) / (available ram + swap)) + oom_score_adj

which is completely different from what you think it is.


Yup, it definitely can, which is why as I mentioned to Kame (who doesn't 
have strong feelings either way, even though you quote him as having these 
strong objections) that you can easily convert oom_score_adj into a 
stand-alone memory quantity (biasing or forgiving 512MB of a task's 
memory, for example) in the context it is currently attached to with 
simple arithemetic in userspace.  That's why oom_score_adj is powerful.


I don't know what this means, and this was your criticism before I changed 
the denominator during the revision of the patchset, so it's probably 
obsoleted.  oom_score_adj always operates based on the proportion of 
memory available to the application which is how the new oom killer 
determines which tasks to kill: relative to the importance (if defined by 
userspace) and memory usage compared to other tasks competing for it.


Nothing specific about any of this to Google.  Users who actually setup 
their machines to use mempolicies, cpusets, or memcgs actually do want a 
powerful interface from userspace to tune the priorities in terms of both 
business goals and also importance of the task.  That is done much more 
powerfully now with oom_score_adj than the previous implementation.  Users 
who don't use these cgroups, especially desktop users, can see 
oom_score_adj in terms of a memory quantity that remains static: they 
aren't going to encounter changing memcg limits, cpuset mems, etc.

That said, I really don't know why you keep mentioning "Google this" and 
"Google that" when the company I'm working for is really irrelevant to 
this discussion.

With that, I respectfully nack your patch.
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., David Rientjes, (Wed Aug 25, 3:25 am)
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., KAMEZAWA Hiroyuki, (Wed Aug 25, 5:39 pm)
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., KAMEZAWA Hiroyuki, (Wed Aug 25, 6:03 pm)
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., KAMEZAWA Hiroyuki, (Wed Aug 25, 6:11 pm)
Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalizatio ..., KAMEZAWA Hiroyuki, (Wed Aug 25, 8:20 pm)