Hello,
Could you please review this patch?
The idea behind it is quite simple: give the dying task a higher priority
so that it can be scheduled sooner and die to free memory.
oom-kill: give the dying task a higher priority
In a system under heavy load it was observed that even after the
oom-killer selects a task to die, the task may take a long time to die.
Right before sending a SIGKILL to the selected task the oom-killer
increases the task priority so that it can exit quickly, freeing memory.
That is accomplished by:
/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
It sounds plausible giving the dying task an even higher priority to be
sure it will be scheduled sooner and free the desired memory.
Signed-off-by: Luis Claudio R. Gonçalves <lclaudio@uudg.org>
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b68e802..8047309 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -382,6 +382,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
*/
static void __oom_kill_task(struct task_struct *p, int verbose)
{
+ struct sched_param param;
+
if (is_global_init(p)) {
WARN_ON(1);
printk(KERN_WARNING "tried to kill init!\n");
@@ -413,6 +415,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
+ param.sched_priority = MAX_RT_PRIO-1;
+ sched_setscheduler(p, SCHED_FIFO, &param);
force_sig(SIGKILL, p);
}
Thanks,
Luis
--
[ Luis Claudio R. Goncalves Bass - Gospel - RT ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]
--
As usual, I can't really comment the changes in oom logic, just minor Probably sched_setscheduler_nocheck() makes more sense. Minor, but perhaps it would be a bit better to send SIGKILL first, then raise its prio. Oleg. --
I have no objection too. but I don't think Oleg's pointed thing is minor. Please send updated patch. Thanks. --
On Fri, May 28, 2010 at 11:54:07AM +0900, KOSAKI Motohiro wrote:
| Hi Luis,
|
| > On 05/27, Luis Claudio R. Goncalves wrote:
| > >
| > > It sounds plausible giving the dying task an even higher priority to be
| > > sure it will be scheduled sooner and free the desired memory.
| >
| > As usual, I can't really comment the changes in oom logic, just minor
| > nits...
| >
| > > @@ -413,6 +415,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
| > > */
| > > p->rt.time_slice = HZ;
| > > set_tsk_thread_flag(p, TIF_MEMDIE);
| > > + param.sched_priority = MAX_RT_PRIO-1;
| > > + sched_setscheduler(p, SCHED_FIFO, &param);
| > >
| > > force_sig(SIGKILL, p);
| >
| > Probably sched_setscheduler_nocheck() makes more sense.
| >
| > Minor, but perhaps it would be a bit better to send SIGKILL first,
| > then raise its prio.
|
| I have no objection too. but I don't think Oleg's pointed thing is minor.
| Please send updated patch.
|
| Thanks.
This version of the patch addresses the suggestions from Oleg Nesterov and
Kosaki Motohiro.
Thanks again for reviewing the patch.
oom-kill: give the dying task a higher priority (v2)
In a system under heavy load it was observed that even after the
oom-killer selects a task to die, the task may take a long time to die.
Right before sending a SIGKILL to the task selected by the oom-killer
this task has it's priority increased so that it can exit() exit soon,
freeing memory. That is accomplished by:
/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
It sounds plausible giving the dying task an even higher priority to be
sure it will be scheduled sooner and free the desired memory. Oleg Nesterov
pointed out it would be interesting sending the signal before increasing
the task ...I would like to understand the visible benefits of this patch. Have you seen an OOM kill tasked really get bogged down. Should this task really be competing with other important tasks for run time? -- Three Cheers, Balbir --
What you mean important? Until OOM victim task exit completely, the system have no memory. all of important task can't do anything. In almost kernel subsystems, automatically priority boost is really bad idea because it may break RT task's deterministic behavior. but OOM is one of exception. The deterministic was alread broken by memory starvation. That's the reason I acked it. --
Hi, Kosaki. On Fri, May 28, 2010 at 1:46 PM, KOSAKI Motohiro Yes or No. IMHO, normally RT tasks shouldn't use dynamic allocation(ie, non-deterministic functions or system calls) in place which is needed deterministic. So memory starvation might not break real-time -- Kind regards, Minchan Kim --
I think It's impossible. Normally RT task use mlock and it prevent almost page allocation. but every syscall internally call kmalloc(). They can't avoid it practically. How do you perfectly avoid dynamic allocation? --
On Fri, May 28, 2010 at 2:39 PM, KOSAKI Motohiro
RT Task
void non-RT-function()
{
system call();
buffer = malloc();
memset(buffer);
}
/*
* We make sure this function must be executed in some millisecond
*/
void RT-function()
{
some calculation(); <- This doesn't have no dynamic characteristic
}
int main()
{
non-RT-function();
/* This function make sure RT-function cannot preempt by others */
set_RT_max_high_priority();
RT-function A();
set_normal_priority();
non-RT-function();
}
We don't want realtime in whole function of the task. What we want is
just RT-function A.
Of course, current Linux cannot make perfectly sure RT-functionA can
not preempt by others.
That's because some interrupt or exception happen. But RT-function A
doesn't related to any dynamic characteristic. What can justify to
preempt RT-function A by other processes?
--
Kind regards,
Minchan Kim
--
As far as my observation, RT-function always have some syscall. because pure calculation doesn't need deterministic guarantee. But _if_ you are really using such priority design. I'm ok maximum NonRT priority instead maximum RT priority too. Luis, NonRT high priority break your use case? and if yes, can you please explain the reason? --
On Fri, May 28, 2010 at 2:59 PM, KOSAKI Motohiro
Hmm. It's just example. but it would be not good exmaple.
Let's change it with this.
void RT-function()
{
int result = some calculation(); <- This doesn't have no dynamic
characteristic
*mmap_base = result; <-- mmap_base is mapped by GPIO device.
}
Could we allow preemption of this RT function due to other task's
memory pressure?
Of course, Linux is not Hard RT featured OS, I think. So I thinks it
is a policy problem.
If we think system memory pressure is more important than RT task and
we _all_ agree such policy, we can allow it.
But I don't hope it.
--
Kind regards,
Minchan Kim
--
On Fri, May 28, 2010 at 02:59:02PM +0900, KOSAKI Motohiro wrote:
| > RT Task
| >
| > void non-RT-function()
| > {
| > system call();
| > buffer = malloc();
| > memset(buffer);
| > }
| > /*
| > * We make sure this function must be executed in some millisecond
| > */
| > void RT-function()
| > {
| > some calculation(); <- This doesn't have no dynamic characteristic
| > }
| > int main()
| > {
| > non-RT-function();
| > /* This function make sure RT-function cannot preempt by others */
| > set_RT_max_high_priority();
| > RT-function A();
| > set_normal_priority();
| > non-RT-function();
| > }
| >
| > We don't want realtime in whole function of the task. What we want is
| > just RT-function A.
| > Of course, current Linux cannot make perfectly sure RT-functionA can
| > not preempt by others.
| > That's because some interrupt or exception happen. But RT-function A
| > doesn't related to any dynamic characteristic. What can justify to
| > preempt RT-function A by other processes?
|
| As far as my observation, RT-function always have some syscall. because pure
| calculation doesn't need deterministic guarantee. But _if_ you are really
| using such priority design. I'm ok maximum NonRT priority instead maximum
| RT priority too.
I confess I failed to distinguish memcg OOM and system OOM and used "in
case of OOM kill the selected task the faster you can" as the guideline.
If the exit code path is short that shouldn't be a problem.
Maybe the right way to go would be giving the dying task the biggest
priority inside that memcg to be sure that it will be the next process from
that memcg to be scheduled. Would that be reasonable?
| Luis, NonRT high priority break your use case? and if yes, can you please
| explain the reason?
Most of my tests are in the realtime land, usually with premmpt_rt kernels.
In this case, an RT priority will be usually necessary. But that is not the
general case and I agree that a smoother (but not slower) ...Hmm. I can't understand your point. What do you mean failing distinguish memcg and system OOM? We already have been distinguish it by mem_cgroup_out_of_memory. (but we have to enable CONFIG_CGROUP_MEM_RES_CTLR). So task selected in select_bad_process is one out of memcg's tasks when memcg have a memory pressure. Isn't it enough? -- Kind regards, Minchan Kim --
We have a routine to help figure out if the task belongs to the memory cgroup that cause the OOM. The OOM entry from memory cgroup is different from a regular one. -- Three Cheers, Balbir --
On Fri, May 28, 2010 at 11:06:23PM +0900, Minchan Kim wrote: | On Fri, May 28, 2010 at 09:53:05AM -0300, Luis Claudio R. Goncalves wrote: | > On Fri, May 28, 2010 at 02:59:02PM +0900, KOSAKI Motohiro wrote: ... | > | As far as my observation, RT-function always have some syscall. because pure | > | calculation doesn't need deterministic guarantee. But _if_ you are really | > | using such priority design. I'm ok maximum NonRT priority instead maximum | > | RT priority too. | > | > I confess I failed to distinguish memcg OOM and system OOM and used "in | > case of OOM kill the selected task the faster you can" as the guideline. | > If the exit code path is short that shouldn't be a problem. | > | > Maybe the right way to go would be giving the dying task the biggest | > priority inside that memcg to be sure that it will be the next process from | > that memcg to be scheduled. Would that be reasonable? | | Hmm. I can't understand your point. | What do you mean failing distinguish memcg and system OOM? | | We already have been distinguish it by mem_cgroup_out_of_memory. | (but we have to enable CONFIG_CGROUP_MEM_RES_CTLR). | So task selected in select_bad_process is one out of memcg's tasks when | memcg have a memory pressure. The approach of giving the highest priority to the dying task makes sense in a system wide OOM situation. I though that would also be good for the memcg OOM case. After Balbir Singh's comment, I understand that in a memcg OOM the dying task should have a priority just above the priority of the main task of that memcg, in order to avoid interfering in the rest of the system. That is the point where I failed to distinguish between memcg and system OOM. Should I pursue that new idea of looking for the right priority inside the memcg or is it overkill? I really don't have a clear view of the impact of a memcg OOM on system performance - don't know if it is better to solve the issue sooner (highest RT priority) or leave it to be solved later ...
I think highest RT proirity ins't good solution. As I mentiond, Some RT functions don't want to be preempted by other processes which cause memory pressure. It makes RT task broken. On the other hand, normal processes don't have a requirement of RT. But it isn't a big problem that it lost little time slice, I think. So how about raising max normal priority? but I am not sure this is right solution. Let's listen other's opinion. -- Kind regards, Minchan Kim --
All the patches I've seen use MAX_RT_PRIO-1, which is actually FIFO-1, which is the lowest RT priority. --
Stupid me. I confused that until now. That's exactly what I want. -- Kind regards, Minchan Kim --
On Sat, May 29, 2010 at 12:12:49AM +0900, Minchan Kim wrote: | On Fri, May 28, 2010 at 11:36:17AM -0300, Luis Claudio R. Goncalves wrote: | > On Fri, May 28, 2010 at 11:06:23PM +0900, Minchan Kim wrote: | > | On Fri, May 28, 2010 at 09:53:05AM -0300, Luis Claudio R. Goncalves wrote: | > | > On Fri, May 28, 2010 at 02:59:02PM +0900, KOSAKI Motohiro wrote: | > ... | > | > | As far as my observation, RT-function always have some syscall. because pure | > | > | calculation doesn't need deterministic guarantee. But _if_ you are really | > | > | using such priority design. I'm ok maximum NonRT priority instead maximum | > | > | RT priority too. | > | > | > | > I confess I failed to distinguish memcg OOM and system OOM and used "in | > | > case of OOM kill the selected task the faster you can" as the guideline. | > | > If the exit code path is short that shouldn't be a problem. | > | > | > | > Maybe the right way to go would be giving the dying task the biggest | > | > priority inside that memcg to be sure that it will be the next process from | > | > that memcg to be scheduled. Would that be reasonable? | > | | > | Hmm. I can't understand your point. | > | What do you mean failing distinguish memcg and system OOM? | > | | > | We already have been distinguish it by mem_cgroup_out_of_memory. | > | (but we have to enable CONFIG_CGROUP_MEM_RES_CTLR). | > | So task selected in select_bad_process is one out of memcg's tasks when | > | memcg have a memory pressure. | > | > The approach of giving the highest priority to the dying task makes sense | > in a system wide OOM situation. I though that would also be good for the | > memcg OOM case. | > | > After Balbir Singh's comment, I understand that in a memcg OOM the dying | > task should have a priority just above the priority of the main task of | > that memcg, in order to avoid interfering in the rest of the system. | > | > That is the point where I failed to distinguish between memcg and system OOM. | > | > Should I pursue ...
What I want to say is that determinisic has no relation with OOM. Why is some RT task affected by other process's OOM? Of course, if system has no memory, it is likely to slow down RT task. But it's just only thought. If some task scheduled just is exit, we don't need to raise OOMed task's priority. But raising min rt priority on your patch was what I want. It doesn't preempt any RT task. So until now, I have made noise about your patch. Really, sorry for that. I don't have any objection on raising priority part from now on. Thanks, Luis. -- Kind regards, Minchan Kim --
On Sat, May 29, 2010 at 12:45:49AM +0900, Minchan Kim wrote: | On Fri, May 28, 2010 at 12:28:42PM -0300, Luis Claudio R. Goncalves wrote: | > On Sat, May 29, 2010 at 12:12:49AM +0900, Minchan Kim wrote: ... | > | I think highest RT proirity ins't good solution. | > | As I mentiond, Some RT functions don't want to be preempted by other processes | > | which cause memory pressure. It makes RT task broken. | > | > For the RT case, if you reached a system OOM situation, your determinism has | > already been hurt. If the memcg OOM happens on the same memcg your RT task | > is - what will probably be the case most of time - again, the determinism | > has deteriorated. For both these cases, giving the dying task SCHED_FIFO | > MAX_RT_PRIO-1 means a faster recovery. | | What I want to say is that determinisic has no relation with OOM. | Why is some RT task affected by other process's OOM? | | Of course, if system has no memory, it is likely to slow down RT task. | But it's just only thought. If some task scheduled just is exit, we don't need | to raise OOMed task's priority. | | But raising min rt priority on your patch was what I want. | It doesn't preempt any RT task. | | So until now, I have made noise about your patch. | Really, sorry for that. | I don't have any objection on raising priority part from now on. This is the third version of the patch, factoring in your input along with Peter's comment. Basically the same patch, but using the lowest RT priority to boost the dying task. Thanks again for reviewing and commenting. Luis oom-killer: give the dying task rt priority (v3) Give the dying task RT priority so that it can be scheduled quickly and die, freeing needed memory. Signed-off-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com> diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 84bbba2..2b0204f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -266,6 +266,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints) */ static ...
Almostly acceptable to me. but I have two requests, - need 1) force_sig() 2)sched_setscheduler() order as Oleg mentioned - don't boost priority if it's in mem_cgroup_out_of_memory() Can you accept this? if not, can you please explain the reason? --
On Sat, May 29, 2010 at 12:59:09PM +0900, KOSAKI Motohiro wrote:
| Hi
|
| > oom-killer: give the dying task rt priority (v3)
| >
| > Give the dying task RT priority so that it can be scheduled quickly and die,
| > freeing needed memory.
| >
| > Signed-off-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com>
|
| Almostly acceptable to me. but I have two requests,
|
| - need 1) force_sig() 2)sched_setscheduler() order as Oleg mentioned
| - don't boost priority if it's in mem_cgroup_out_of_memory()
|
| Can you accept this? if not, can you please explain the reason?
|
| Thanks.
The last patch I posted was the wrong patch from my queue. Sorry for the
confusion. Here is the last version of the patch, including the suggestions
from Oleg, Peter and Kosaki Motohiro:
oom-kill: give the dying task a higher priority (v4)
In a system under heavy load it was observed that even after the
oom-killer selects a task to die, the task may take a long time to die.
Right before sending a SIGKILL to the task selected by the oom-killer
this task has it's priority increased so that it can exit() exit soon,
freeing memory. That is accomplished by:
/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
It sounds plausible giving the dying task an even higher priority to be
sure it will be scheduled sooner and free the desired memory. It was
suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this
task won't interfere with any running RT task.
Another good suggestion, implemented here, was to avoid boosting the dying
task priority in case of mem_cgroup OOM.
Signed-off-by: Luis Claudio R. Gonçalves <lclaudio@uudg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ...Hi, Kosaki. On Sat, May 29, 2010 at 12:59 PM, KOSAKI Motohiro Why do you want to not boost priority if it's path of memcontrol? If it's path of memcontrol and CONFIG_CGROUP_MEM_RES_CTLR is enabled, mem_cgroup_out_of_memory will select victim task in memcg. So __oom_kill_task's target task would be in memcg, I think. As you and memcg guys don't complain this, I would be missing something. Could you explain it? :) -- Kind regards, Minchan Kim --
Yep. But priority boost naturally makes CPU starvation for out of the group processes. So, My points are, 1) Usually priority boost is wrong idea. It have various side effect, but system wide OOM is one of exception. In such case, all tasks aren't runnable, then, the downside is acceptable. 2) memcg have OOM notification mechanism. If the admin need priority boost, they can do it by their OOM-daemon. Thanks. --
On Mon, May 31, 2010 at 3:35 PM, KOSAKI Motohiro Is it possible kill the hogging task immediately when the daemon send kill signal? I mean we can make OOM daemon higher priority than others and it can send signal to normal process. but when is normal process exited after receiving kill signal from OOM daemon? Maybe it's when killed task is executed by scheduler. It's same problem again, I think. -- Kind regards, Minchan Kim --
On Mon, 31 May 2010 16:05:48 +0900 This is just an idea and I have no implementaion, yet. With memcg, oom situation can be recovered by "enlarging limit temporary". Then, what the daemon has to do is 1. send signal (kill or other signal to abort for coredump.) 2. move a problematic task to a jail if necessary. 3. enlarge limit for indicating "Go" 4. After stabilization, reduce the limit. This is the fastest. Admin has to think of extra-room or jails and the daemon should be enough clever. But in most case, I think this works well. Thanks, -Kame --
On Mon, May 31, 2010 at 4:25 PM, KAMEZAWA Hiroyuki I think it is very hard that how much we have to make extra-room since we can't expect how many tasks are stuck to allocate memory. But tend to agree that system-wide OOM problem is more important than memcg's one. And memcg's guy doesn't seem to have any problem. So I am not against this patch any more. -- Kind regards, Minchan Kim --
I can't understand your point, still. Why you put the priority as "MAX_RT_PRIO - 10"? Why do you change sched_setscheduler_nocheck with sched_set_scheduler? It means you can't boost prioity if current context doesn't have permission. -- Kind regards, Minchan Kim --
On Fri, 28 May 2010 13:48:26 -0300 BTW, how about the other threads which share mm_struct ? Thanks, -Kame --
Hi, Kame. On Mon, May 31, 2010 at 9:21 AM, KAMEZAWA Hiroyuki -- Kind regards, Minchan Kim --
On Mon, 31 May 2010 14:01:03 +0900 IIUC, the purpose of rising priority is to accerate dying thread to exit() for freeing memory AFAP. But to free memory, exit, all threads which share mm_struct should exit, too. I'm sorry if I miss something. Thanks, -Kame --
On Mon, May 31, 2010 at 2:04 PM, KAMEZAWA Hiroyuki How do we kill only some thread and what's the benefit of it? I think when if some thread receives KILL signal, the process include the thread will be killed. -- Kind regards, Minchan Kim --
On Mon, 31 May 2010 14:46:05 +0900 yes, so, if you want a _process_ die quickly, you have to acceralte the whole threads on a process. Acceralating a thread in a process is not big help. Thanks, -Kame --
On Mon, May 31, 2010 at 2:54 PM, KAMEZAWA Hiroyuki Yes. I see the code. oom_kill_process is called by 1. mem_cgroup_out_of_memory 2. __out_of_memory 3. out_of_memory (1,2) calls select_bad_process which select victim task in processes by do_each_process. But 3 isn't In case of CONSTRAINT_MEMORY_POLICY, it kills current. -- Kind regards, Minchan Kim --
On Mon, 31 May 2010 15:09:41 +0900 Hmm, my point is that priority-acceralation is against a thread, not against a process. So, most of threads in memory-eater will not gain high priority even with this patch and works slowly. I have no objections to this patch. I just want to confirm the purpose. If this patch is for accelating exiting process by SIGKILL, it seems not enough. If an explanation as "acceralating all thread's priority in a process seems overkill" is given in changelog or comment, it's ok to me. Thanks, -Kame --
On Mon, May 31, 2010 at 3:51 PM, KAMEZAWA Hiroyuki Okay. I got your point. Kame's concern is proper. -- Kind regards, Minchan Kim --
On Mon, May 31, 2010 at 03:51:02PM +0900, KAMEZAWA Hiroyuki wrote: | On Mon, 31 May 2010 15:09:41 +0900 | Minchan Kim <minchan.kim@gmail.com> wrote: | > On Mon, May 31, 2010 at 2:54 PM, KAMEZAWA Hiroyuki | > <kamezawa.hiroyu@jp.fujitsu.com> wrote: ... | > >> > IIUC, the purpose of rising priority is to accerate dying thread to exit() | > >> > for freeing memory AFAP. But to free memory, exit, all threads which share | > >> > mm_struct should exit, too. I'm sorry if I miss something. | > >> | > >> How do we kill only some thread and what's the benefit of it? | > >> I think when if some thread receives KILL signal, the process include | > >> the thread will be killed. | > >> | > > yes, so, if you want a _process_ die quickly, you have to acceralte the whole | > > threads on a process. Acceralating a thread in a process is not big help. | > | > Yes. | > | > I see the code. | > oom_kill_process is called by | > | > 1. mem_cgroup_out_of_memory | > 2. __out_of_memory | > 3. out_of_memory | > | > | > (1,2) calls select_bad_process which select victim task in processes | > by do_each_process. | > But 3 isn't In case of CONSTRAINT_MEMORY_POLICY, it kills current. | > In only the case, couldn't we pass task of process, not one of thread? | > | | Hmm, my point is that priority-acceralation is against a thread, not against a process. | So, most of threads in memory-eater will not gain high priority even with this patch | and works slowly. This is a good point... | I have no objections to this patch. I just want to confirm the purpose. If this patch | is for accelating exiting process by SIGKILL, it seems not enough. I understand (from the comments in the code) the badness calculation gives more points to the siblings in a thread that have their own mm. I wonder if what you are describing is not a corner case. Again, your idea sounds like an interesting refinement to the patch. I am just not sure this change should implemented now or in a second round ...
On Mon, 31 May 2010 10:52:27 -0300 yes, nice catch. Thanks, -Kame --
On Tue, Jun 01, 2010 at 08:50:06AM +0900, KAMEZAWA Hiroyuki wrote:
| On Mon, 31 May 2010 10:52:27 -0300
| "Luis Claudio R. Goncalves" <lclaudio@uudg.org> wrote:
|
| > | If an explanation as "acceralating all thread's priority in a process seems overkill"
| > | is given in changelog or comment, it's ok to me.
| >
| > If my understanding of badness() is right, I wouldn't be ashamed of saying
| > that it seems to be _a bit_ overkill. But I may be wrong in my
| > interpretation.
| >
| > While re-reading the code I noticed that in select_bad_process() we can
| > eventually bump on an already dying task, case in which we just wait for
| > the task to die and avoid killing other tasks. Maybe we could boost the
| > priority of the dying task here too.
| >
| yes, nice catch.
Here is a more complete version of the patch, boosting priority on the
three exit points of the OOM-killer. I also avoid touching the priority if
the task is already an RT task. The patch:
oom-kill: give the dying task a higher priority (v5)
In a system under heavy load it was observed that even after the
oom-killer selects a task to die, the task may take a long time to die.
Right before sending a SIGKILL to the task selected by the oom-killer
this task has it's priority increased so that it can exit() exit soon,
freeing memory. That is accomplished by:
/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
It sounds plausible giving the dying task an even higher priority to be
sure it will be scheduled sooner and free the desired memory. It was
suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
this task won't interfere with any running RT task.
If the dying task is already an RT task, leave it untouched.
Another good suggestion, implemented here, ...That's unnecessary, if p already has TIF_MEMDIE set, then This has the potential to actually make it harder to free memory if p is waiting to acquire a writelock on mm->mmap_sem in the exit path while the thread holding mm->mmap_sem is trying to run.
if p is waiting, changing prio have no effect. It continue tol wait to release mmap_sem. --
On Wed, Jun 02, 2010 at 10:54:01PM +0900, KOSAKI Motohiro wrote:
| > > @@ -291,9 +309,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
| > > * Otherwise we could get an easy OOM deadlock.
| > > */
| > > if (p->flags & PF_EXITING) {
| > > - if (p != current)
| > > + if (p != current) {
| > > + boost_dying_task_prio(p, mem);
| > > return ERR_PTR(-1UL);
| > > -
| > > + }
| > > chosen = p;
| > > *ppoints = ULONG_MAX;
| > > }
| >
| > This has the potential to actually make it harder to free memory if p is
| > waiting to acquire a writelock on mm->mmap_sem in the exit path while the
| > thread holding mm->mmap_sem is trying to run.
|
| if p is waiting, changing prio have no effect. It continue tol wait to release mmap_sem.
Ok, that was not a good idea after all :)
But I understand the !rt_task(p) test is necessary to avoid decrementing
the priority of an eventual RT task selected to die. Though it may also be
a corner case in badness().
Luis
--
[ Luis Claudio R. Goncalves Bass - Gospel - RT ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]
--
And that can reduce the runtime of the thread holding a writelock on mm->mmap_sem, making the exit actually take longer than without the patch if its priority is significantly higher, especially on smaller machines. --
If p need mmap_sem, p is going to sleep to wait mmap_sem. if p doesn't, quickly exit is good thing. In other word, task fairness is not our goal when oom occur. --
On Thu, Jun 3, 2010 at 8:36 AM, KOSAKI Motohiro
Tend to agree. I didn't agree boosting of whole threads' priority.
Task fairness VS system hang is trade off. task fairness is best
effort but system hang is critical.
Also, we have tried to it.
/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
But I think above code is meaningless unless p use SCHED_RR.
So boosting of lowest RT priority with FIFO is to meet above comment's
goal, I think.
--
Kind regards,
Minchan Kim
--
/me smells an inversion... on -rt we solved those ;-) --
Right, but I don't see how increasing an oom killed tasks priority to a divine priority doesn't impact the priorities of other tasks which may be blocking the exit of that task, namely a coredumper or holder of mm->mmap_sem. This patch also doesn't address how it negatively impacts the priorities of jobs running in different cpusets (although sharing the same cpus) because one cpuset is oom. --
On Mon, May 31, 2010 at 10:52 PM, Luis Claudio R. Goncalves
First of all, I think your patch is first.
That's because I am not sure this logic is effective.
/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
Peter changed it in fa717060f1ab.
Now if we change rt.time_slice as HZ, it means the task have high priority?
I am not a scheduler expert. but as I looked through scheduler code,
rt.time_slice is only related to RT scheduler. so if we uses CFS, it
doesn't make task high priority.
Perter, Right?
If it is right, I think Luis patch will fix it.
Secondly, as Kame pointed out, we have to raise whole thread's
priority to kill victim process for reclaiming pages. But I think it
has deadlock problem.
If we raise whole threads's priority and some thread has dependency of
other thread which is blocked, it makes system deadlock. So I think
it's not easy part.
--
Kind regards,
Minchan Kim
--
Agreed, this has the potential to actually increase the amount of time for an oom killed task to fully exit: the exit path takes mm->mmap_sem on exit and if that is held by another thread waiting for the oom killed task to exit (i.e. reclaim has failed and the oom killer becomes a no-op because it sees an already killed task) then there's a livelock. That's always been a problem, but is compounded with increasing the priority of a task not holding mm->mmap_sem if the thread holding the writelock actually isn't looking for memory but simply doesn't get a chance to release because it fails to run. --
I am still not convinced, specially if we are running under mem cgroup. Even setting SCHED_FIFO does not help, you could have other things like cpusets that might restrict the CPUs you can run on, or any other policy and we could end up contending anyway with other If we could show faster recovery from OOM or anything else, I would be more convinced. -- Three Cheers, Balbir --
Ah, right you are. I had missed mem-cgroup. But I think memcgroup also don't need following two boost. Can we get rid of it? p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); --
On Fri, 28 May 2010 11:57:01 +0530
Off topic.
1. Run a daemon in the highest RT priority.
2. disable OOM for a mem cgroup.
3. The daemon register oom-event-notifier of the mem cgroup.
When OOM happens.
4. The daemon receive a event, and then,
a) enlarge limit
or
b) kill a task
or
c) enlarge limit temporary and kill a task, later, reduce limit again.
This is the fastest and promissing operation for memcg users.
memcg's oom slowdown happens just because it's limited by a user configuration
not by the system. That's a point to be considered.
The oom situation can be _immediaterly_ fixed up by enlarge limit as emergency mode.
If you has to wait for the end of a task, there will be delay, it's unavoidable.
Thanks,
-Kame
--
Argh, so you got me confused as well. the sched_param ones are userspace values, so you should be using 1. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
