I understand it's a little tricky when dealing with memcg-constrained oom
conditions versus system-wide oom conditions. Think of it this way: if
the system is oom, then every memcg, every cpuset, and every mempolicy is
also oom. That doesn't imply that something in every memcg, every cpuset,
or every mempolicy must be killed, however. What cgroup happens to be
penalized in this scenario isn't necessarily the scope of oom_score_adj's
purpose. oom_score_adj certainly does have a stronger influence over a
task's priority when it's a system oom and not a memcg oom because the
size of available memory is different, but that's fine: we set positive
and negative oom_score_adj values for a reason based on the application,
and that's not necessarily (but can be) a function of the memcg or system
capacity. Again, oom_score_adj is only meaningful when considered
relative to other candidate tasks since the badness score itself is
considered relative to other candidate tasks.
You can have multiple tasks that have +1000 oom_score_adj values (or
multiple tasks that have +15 oom_adj values). Only one will be killed and
it's dependent only on the ordering of the tasklist. That isn't an
exception case, that only means that we prevented needless oom killing.
If the memcg limit changes because we're attaching more tasks, yes, we may
want to change its oom_score_adj relative to those tasks. So
oom_score_adj is a function of the attached tasks and its allowed set of
resources in comparison to them, not the limit itself.
Yeah, that conversion could be useful if the system RAM capacity or memcg
limit, etc, remains static.
And the reverse, it can get priority to be killed :)
Very nice, and the "bonus" there is what the task can safely use in
comparison to any other task competing for the same resources without
getting selected itself because of that memory.
--