Current oom_score_adj is completely broken because It is strongly bound google usecase and ignore other all. 1) Priority inversion As kamezawa-san pointed out, This break cgroup and lxr environment. He said, > Assume 2 proceses A, B which has oom_score_adj of 300 and 0 > And A uses 200M, B uses 1G of memory under 4G system > > Under the system. > A's socre = (200M *1000)/4G + 300 = 350 > B's score = (1G * 1000)/4G = 250. > > In the cpuset, it has 2G of memory. > A's score = (200M * 1000)/2G + 300 = 400 > B's socre = (1G * 1000)/2G = 500 > > This priority-inversion don't happen in current system. 2) Ratio base point don't works large machine oom_score_adj normalize oom-score to 0-1000 range. but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean 1GB. this is no suitable for tuning parameter. As I said, proposional value oriented tuning parameter has scalability risk. 3) No reason to implement ABI breakage. old tuning parameter mean) oom-score = oom-base-score x 2^oom_adj new tuning parameter mean) oom-score = oom-base-score + oom_score_adj / (totalram + totalswap) but "oom_score_adj / (totalram + totalswap)" can be calculated in userland too. beucase both totalram and totalswap has been exporsed by /proc. So no reason to introduce funny new equation. 4) totalram based normalization assume flat memory model. example, the machine is assymmetric numa. fat node memory and thin node memory might have another wight value. In other word, totalram based priority is a one of policy. Fixed and workload depended policy shouldn't be embedded in kernel. probably. Then, this patch remove *UGLY* total_pages suck completely. Googler can calculate it at userland! Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- fs/proc/base.c | 33 ++--------- include/linux/oom.h | 16 +----- include/linux/sched.h | 2 +- mm/oom_kill.c | 142 ...
oom_adj is not only used for kernel knob, but also used for application interface. Then, adding new knob is no good reason to deprecate it. Also, after former patch, oom_score_adj can't be used for setting OOM_DISABLE. We need "echo -17 > /proc/<pid>/oom_adj" thing. This reverts commit 51b1bd2ace1595b72956224deda349efa880b693. --- Documentation/feature-removal-schedule.txt | 25 ------------------------- Documentation/filesystems/proc.txt | 3 --- fs/proc/base.c | 8 -------- include/linux/oom.h | 3 --- 4 files changed, 0 insertions(+), 39 deletions(-) diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 842aa9d..aff4d11 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -151,31 +151,6 @@ Who: Eric Biederman <ebiederm@xmission.com> --------------------------- -What: /proc/<pid>/oom_adj -When: August 2012 -Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's - badness heuristic used to determine which task to kill when the kernel - is out of memory. - - The badness heuristic has since been rewritten since the introduction of - this tunable such that its meaning is deprecated. The value was - implemented as a bitshift on a score generated by the badness() - function that did not have any precise units of measure. With the - rewrite, the score is given as a proportion of available memory to the - task allocating pages, so using a bitshift which grows the score - exponentially is, thus, impossible to tune with fine granularity. - - A much more powerful interface, /proc/<pid>/oom_score_adj, was - introduced with the oom killer rewrite that allows users to increase or - decrease the badness() score linearly. This interface will replace - /proc/<pid>/oom_adj. - - A warning will be emitted to the kernel log if an application uses this - deprecated interface. After ...
Since I nacked the parent patch of this, I implicitly nack this one as well since oom_score_adj shouldn't be going anywhere. The way to disable oom killing for a task via the new interface, /proc/pid/oom_score_adj, is by OOM_SCORE_ADJ_MIN as specified in the documentation. --
That's wrong, we don't even use this heuristic yet and there is nothing, You continually bring this up, and I've answered it three times, but you've never responded to it before and completely ignore it. I really hope and expect that you'll participate more in the development process and not continue to reinterate your talking points when you have no answer to my response. You're wrong, especially with regard to cpusets, which was formally part of the heuristic itself. Users bind an aggregate of tasks to a cgroup (cpusets or memcg) as a means of isolation and attach a set of resources (memory, in this case) for those tasks to use. The user who does this is fully aware of the set of tasks being bound, there is no mystery or unexpected results when doing so. So when you set an oom_score_adj for a task, you don't necessarily need to be aware of the set of resources it has available, which is dynamic and an attribute of the system or cgroup, but rather the priority of that task in competition with other tasks for the same resources. _That_ is what is important in having a userspace influence on a badness heursitic: how those badness scores compare relative to other tasks that share the same resources. That's how a task is chosen for oom kill, not because of a static formula such as you're introducing here that outputs a value (and, thus, a priority) regardless of the context in which the task is bound. That also means that the same task is not necessarily killed in a cpuset-constrained oom compared to a system-wide oom. If you bias a task by 30% of available memory, which Kame did in his example above, it's entirely plausible that task A should be killed because it's actual usage is only 1/20th of the machine. When its cpuset is oom, and the admin has specifically bound that task to only 2G of memory, we'd natually want to kill the memory hogger, that is using 50% of the total memory available to So you'd rather use the range of oom_adj ...
On Wed, 25 Aug 2010 03:25:25 -0700 (PDT) I'm now trying to write a userspace tool to calculate this, for me. Then, could you update documentation ? == 3.2 /proc/<pid>/oom_score - Display current oom-killer score ------------------------------------------------------------- This file can be used to check the current score used by the oom-killer is for any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which process should be killed in an out-of-memory situation. == add a some documentation like: == (For system monitoring tool developpers, not for usual users.) oom_score calculation is implemnentation dependent and can be modified without any caution. But current logic is oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj proc's rss and swap can be obtained by /proc/<pid>/statm and available ram + swap is dependent on the situation. If the system is totaly under oom, available ram == /proc/meminfo's MemTotal available swap == in most case == /proc/meminfo's SwapTotal When you use memory cgroup, When swap is limited, avaliable ram + swap == memory cgroup's memsw limit. When swap is unlimited, avaliable ram + swap = memory cgroup's memory limit + SwapTotal Then, please be careful that oom_score's order among tasks depends on the situation. Assume 2 proceses A, B which has oom_score_adj of 300 and 0 And A uses 200M, B uses 1G of memory under 4G system Under the 4G system. A's socre = (200M *1000)/4G + 300 = 350 B's score = (1G * 1000)/4G = 250. In the memory cgroup, it has 2G of resource. A's score = (200M * 1000)/2G + 300 = 400 B's socre = (1G * 1000)/2G = 500 You shoudn't depend on /proc/<pid>/oom_score if you have to handle OOM under cgroups and cpuset. But the logic is simple. == If you don't want, I'll add text and a sample tool to cgroup/memory.txt. Thanks, -Kame --
You'll want to look at section 3.1 of Documentation/filesystems/proc.txt, I'd hesitate to state the formula outside of the implementation and instead focus on the semantics of oom_score_adj (as a proportion of available memory compared to other tasks), which I tried doing in section 3.1. Then, the userspace tool only need be concerned about the units of oom_score_adj rather than whether rss, swap, or later extentions such as shm are added. Thanks for working on this, Kame! --
On Wed, 25 Aug 2010 17:52:06 -0700 (PDT) BTW, why you don't subtract the amount of Hugepages ? The old code did "totalrampages - hugepage" as available memory. IIUC, the number of hugepages is not accounted into mm->rss, so, isn't it better to subtract # of hugepage ? Hmm...makes no difference ? Thanks, -Kame --
On Wed, 25 Aug 2010 17:52:06 -0700 (PDT) Hmm. I'll add a text like following to cgroup/memory.txt. O.K. ? == Notes on oom_score and oom_score_adj. oom_score is calculated as oom_score = (taks's proportion of memory) + oom_score_adj. Then, when you use oom_score_adj to control the order of priority of oom, you should know about the amount of memory you can use. So, an approximate oom_score under memcg can be memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit + oom_score_adj. And yes, this can be affected by hierarchy control of memcg and calculation will be more complicated. See, oom_disable feature also. == Thanks, -Kame --
I'd replace "memory" with "memory limit (or memsw limit)" so it's clear Hmm, you need to know the amount of memory that you can use iff you know the memcg limit and it's a static value. Otherwise, you only need to know the "memory usage of your application relative to others in the same cgroup." An oom_score_adj of +300 adds 30% of that memcg's limit to the task, allowing all other tasks to use 30% more memory than that task with it still be killed. An oom_score_adj of -300 allows that task to use 30% more memory than other tasks without getting killed. These don't need to Right, that's the exact score within the memcg. But, I still wouldn't encourage a formula like this because the memcg limit (or cpuset mems, mempolicy nodes, etc) are dynamic and may change out from under us. So it's more important to define oom_score_adj in the user's mind as a proportion of memory available to be added (either positively or negatively) to its memory use when comparing it to other tasks. The point is that the memcg limit isn't interesting in this formula, it's more important to understand the priority of the task _compared_ to other tasks memory usage in that memcg. It probably would be helpful, though, if you know that a vital system task uses 1G, for instance, in a 4G memcg that an oom_score_adj of -250 will disable oom killing for it. If that tasks leaks memory or becomes significantly large, for whatever reason, it could be killed, but we _can_ discount the 1G in comparison to other tasks as the "cost of doing business" when it comes to vital system tasks: (memory usage) * (memory+swap limit / system memory) --
On Wed, 25 Aug 2010 19:50:22 -0700 (PDT) yes. For defineing/understanding priority, oom_score_adj is that. yes. under 8G system, -250 will allow ingnoring 2G of usage. == How about this text ? == When you set a task's oom_score_adj, it can get priority not to be oom-killed. oom_score_adj gives priority proportional to the memory limitation. Assuming you set -250 to oom_score_adj. Under 4G memory limit, it gets 25% of bonus...1G memory bonus for avoiding OOM. Under 8G memory limit, it gets 25% of bonus...2G memory bonus for avoiding OOM. Then, what bonus a task can get depends on the context of OOM. If you use oom_score_adj and want to give bonus to a task, setting it in regard with minimum memory limitation which a task is under will work well. == Thanks, -Kame --
I understand it's a little tricky when dealing with memcg-constrained oom conditions versus system-wide oom conditions. Think of it this way: if the system is oom, then every memcg, every cpuset, and every mempolicy is also oom. That doesn't imply that something in every memcg, every cpuset, or every mempolicy must be killed, however. What cgroup happens to be penalized in this scenario isn't necessarily the scope of oom_score_adj's purpose. oom_score_adj certainly does have a stronger influence over a task's priority when it's a system oom and not a memcg oom because the size of available memory is different, but that's fine: we set positive and negative oom_score_adj values for a reason based on the application, and that's not necessarily (but can be) a function of the memcg or system capacity. Again, oom_score_adj is only meaningful when considered relative to other candidate tasks since the badness score itself is considered relative to other candidate tasks. You can have multiple tasks that have +1000 oom_score_adj values (or multiple tasks that have +15 oom_adj values). Only one will be killed and it's dependent only on the ordering of the tasklist. That isn't an If the memcg limit changes because we're attaching more tasks, yes, we may want to change its oom_score_adj relative to those tasks. So oom_score_adj is a function of the attached tasks and its allowed set of Yeah, that conversion could be useful if the system RAM capacity or memcg Very nice, and the "bonus" there is what the task can safely use in comparison to any other task competing for the same resources without getting selected itself because of that memory. --
Please show us an evidence. Big mouth is no good way to persuade us. Yes, I ignored. Don't talk your dream. I hope to see concrete use-case. As I repeatedly said, I don't care you while you ignore real world end user. ANY BODY DON'T EXCEPT STABILIZATION DEVELOPERS ARE KINDFUL FOR END USER HARMFUL. WE HAVE NO MERCY WHILE YOU CONTINUE TO INMORAL DEVELOPMENT. I'm waiting ome more day. Pray! anyone join to this discussion and explain real use instead you. We don't ignore end-user. But nobody except you reponce this even though I don't care your , I definitely I agree your implementation works fine if admins have the same policy No. As I said, - If you want to solve minority issue, you have to keep no regression for majority user. - If you want to solve major isssue and making bug change. Investigate world wide use case carefully. and refrect it. oom_score_adj was pointed out it overlook a lot of use case. then I you equetion can be changed (rss + swap) + oom_score_adj x (available ram + swap) ----------------------------------------------------------- (available ram + swap) I already explained asymmetric numa issue in past. again, don't assuem Sorry, I don't care this. Please fix you. Thanks. --
We are certainly looking forward to using this when 2.6.36 is released I'm not ignoring any user with this change, oom_score_adj is an extremely powerful interface for users who want to use it. I'm sorry that it's not as simple to use as you may like. Basically, it comes down to this: few users actually tune their oom killing priority, period. That's partly because they accept the oom killer's heuristics to kill a memory-hogging task or use panic_on_oom, or because the old interface, /proc/pid/oom_adj, had no unit and no logical way of using it other than polarizing it (either +15 or -17). For those users who do change their oom killing priority, few are using cpusets or memcg. Yes, the priority changes depending on the context of the oom, but for users who don't use these cgroups the oom_score_adj unit is static since the amount of system memory (the only oom constraint) is static. Now, for the users of both oom_score_adj and cpusets or memcg (in the future this will include Google), these users are interested in oom killing priority relative to other tasks attached to the same set of resources. For our particular use case, we attach an aggregate of tasks to a cgroup and have a preference on the order in which those tasks are killed whenever that cgroup limit is exhausted. We also care about protecting vital system tasks so that they aren't targeted before others are killed, such as job schedulers. I think the key point your missing in our use case is that we don't necessary care about the system-wide oom condition when we're running with cpusets or memcg. We can protect tasks with negative oom_score_adj, but we don't care about close tiebreakers on which cpuset or memg is penalized when the entire system is out of memory. If that's the case, each cpuset and memcg is also, by definition, out of memory, so they are all subject to the oom killer. This is equivalent to having several tasks with an oom_score_adj of +1000 (or oom_adj of ...
Of cource, there is simply just zero justification. Who ask google usage? Every developer have their own debugging and machine administate patches. Don't you concern why we don't push them into upstream? only one usage is Could you please be serious? We are not making a sand castle, we are making a kernel. You have to understand the difference of them. Zero user feature Unrelated. We already have oom notifier. we have no reason to add new knob memcg limit already have been exposed via /cgroup/memory.limit_in_bytes. It's clearly userland role. The fact is, my patch is more powerful than yours because your patch has fixed oom management policy, but mine don't. It can be custermized to adjust customer. More importantly, We already have oom notifier. and It is most powerful infrastructure. It can be constructed any oom policy freely. It's not restricted kernel implementaion. The fact is, your new interface don't match HPC, Server, Banking systems and embedded, AFAIK. At least I couldn't find such usercase in my job experience of such area. Also, now I'm jourlist theresore I have some connection of linux user group. but I didn't get positive feedback for your. instead got some negative feedback. Of cource, I don't know all of the world theresore I did ask you real world usercase repeatedly. But Unrelated. You are still talking about your policy. Why do we need care it? Well, this is clearly bug. oom_adj was changed behavior. and It was deprecated by mistake, therefore latest kernel output pointless warnings each boot time. That said, Be serious! otherwise GO AWAY. --
/proc/pid/oom_adj was introduced before cpusets or memcg, so we were dealing with a static amount of system resources that would not change out from underneath an application. Although that's not a defense of using a bitshift on a heuristic that included many arbitrary selections, it was reasonable to make it only a scalar that didn't consider the amount of resources that an application was allowed to access. As time moved on, cgroups such as cpusets and memcg were introduced (and mempolicies became much more popular as larger NUMA machines became more popular in the industry) that bound or restricted the amount of memory that an aggregate of tasks could access. Each of those methods may change the amount of memory resources that an application has available to it at any time without knowledge of that application, job scheduler, or system daemon. Therefore, it's important to define a more powerful oom_adj mechanism that can properly attribute the oom killing priority of a task in comparison to others that are bound to the same resources without causing a regression on users who don't use those mechanisms that allow With respect to memory isolation and the oom killer, we want to follow the upstream behavior as much as possible. We are looking forward to this interface being available in 2.6.36 and will begin to actively use it once it is released. So although there is no existing user today, there will be when 2.6.36 is released. I also hope that other users of cpusets or memcg will find it helpful to define oom killing priority with a unit rather than a bitshift and that understands the dynamic nature of resource There is no generic oom notifier solution other than what is implemented in the memory controller, and that comes at a cost of roughly 1% of system RAM since that's the amount of metadata that the memcg requires. oom_score_adj works for cpusets, memcg, and mempolicies. Now you may insist that the fact that very few users actually ...
