Hi Here is updated series for various OOM fixes. Almost fixes are trivial. One big improvement is Luis's dying task priority boost patch. This is necessary for RT folks. oom: don't try to kill oom_unkillable child oom: oom_kill_process() doesn't select kthread child oom: make oom_unkillable_task() helper function oom: oom_kill_process() need to check p is unkillable oom: /proc/<pid>/oom_score treat kernel thread honestly oom: kill duplicate OOM_DISABLE check oom: move OOM_DISABLE check from oom_kill_task to out_of_memory() oom: cleanup has_intersects_mems_allowed() oom: remove child->mm check from oom_kill_process() oom: give the dying task a higher priority oom: multi threaded process coredump don't make deadlock fs/proc/base.c | 5 ++- mm/oom_kill.c | 100 +++++++++++++++++++++++++++++++++++++++----------------- 2 files changed, 73 insertions(+), 32 deletions(-) --
Now, badness() doesn't care neigher CPUSET nor mempolicy. Then
if the victim child process have disjoint nodemask, OOM Killer might
kill innocent process.
This patch fixes it.
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26ae697..0aeacb2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -429,7 +429,7 @@ static int oom_kill_task(struct task_struct *p)
static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
unsigned long points, struct mem_cgroup *mem,
- const char *message)
+ nodemask_t *nodemask, const char *message)
{
struct task_struct *victim = p;
struct task_struct *child;
@@ -469,6 +469,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
continue;
if (mem && !task_in_mem_cgroup(child, mem))
continue;
+ if (!has_intersects_mems_allowed(child, nodemask))
+ continue;
/* badness() returns 0 if the thread is unkillable */
child_points = badness(child, uptime.tv_sec);
@@ -519,7 +521,7 @@ retry:
if (!p || PTR_ERR(p) == -1UL)
goto out;
- if (oom_kill_process(p, gfp_mask, 0, points, mem,
+ if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
"Memory cgroup out of memory"))
goto retry;
out:
@@ -678,7 +680,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* non-zero, current could not be killed so we must fallback to
* the tasklist scan.
*/
- if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
+ if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask,
"Out of memory (oom_kill_allocating_task)"))
return;
}
@@ -697,7 +699,7 @@ retry:
panic("Out of memory and no killable processes...\n");
}
- if (oom_kill_process(p, gfp_mask, order, points, NULL,
+ if ...Now, select_bad_process() have PF_KTHREAD check, but oom_kill_process doesn't. It mean oom_kill_process() may choose wrong task, especially, when the child are using use_mm(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/oom_kill.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 0aeacb2..dc8589e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -467,6 +467,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, if (child->mm == p->mm) continue; + if (child->flags & PF_KTHREAD) + continue; if (mem && !task_in_mem_cgroup(child, mem)) continue; if (!has_intersects_mems_allowed(child, nodemask)) -- 1.6.5.2 --
Is it possible child is kthread even though parent isn't kthread? -- Kind regards, Minchan Kim --
Usually unhappen. but crappy driver can do any strange thing freely. As I said, oom code should have conservative assumption as far as possible. --
Okay. You change the check with oom_unkillable_task at last. The oom_unkillable_task is generic function so that the kthread check in oom_kill_process is tivial, I think. -- Kind regards, Minchan Kim --
Now, we have the same task check in two places. Unify it.
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 33 ++++++++++++++++++++++-----------
1 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dc8589e..a4a5439 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -101,6 +101,26 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
return NULL;
}
+/* return true if the task is not adequate as candidate victim task. */
+static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t *nodemask)
+{
+ if (is_global_init(p))
+ return true;
+ if (p->flags & PF_KTHREAD)
+ return true;
+
+ /* When mem_cgroup_out_of_memory() and p is not member of the group */
+ if (mem && !task_in_mem_cgroup(p, mem))
+ return true;
+
+ /* p may not have freeable memory in nodemask */
+ if (!has_intersects_mems_allowed(p, nodemask))
+ return true;
+
+ return false;
+}
+
/**
* badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
@@ -295,12 +315,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
for_each_process(p) {
unsigned long points;
- /* skip the init task and kthreads */
- if (is_global_init(p) || (p->flags & PF_KTHREAD))
- continue;
- if (mem && !task_in_mem_cgroup(p, mem))
- continue;
- if (!has_intersects_mems_allowed(p, nodemask))
+ if (oom_unkillable_task(p, mem, nodemask))
continue;
/*
@@ -467,11 +482,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
if (child->mm == p->mm)
continue;
- if (child->flags & PF_KTHREAD)
- continue;
- if (mem && !task_in_mem_cgroup(child, mem))
- continue;
- if (!has_intersects_mems_allowed(child, nodemask))
+ if (oom_unkillable_task(p, mem, nodemask))
...I returend this patch as review 7/11. Why didn't you check p->signal->oom_adj == OOM_DISABLE in here? I don't figure out code after your patches are applied totally. But I think it would be check it in this function as function's name says. -- Kind regards, Minchan Kim --
For preserve select_bad_process() semantics. It have
for_each_process(p) {
if (oom_unkillable_task(p, mem, nodemask))
continue;
if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
if (p != current)
return ERR_PTR(-1UL);
chosen = p;
*ppoints = ULONG_MAX;
}
if (oom_adj == OOM_DISABLE)
continue;
That said, Current OOM-Killer intend to kill PF_EXITING process even if
it have OOM_DISABLE. (practically, it's not kill. it only affect to give
allocation bonus to PF_EXITING process)
My trivial fixes series don't intend to make large semantics change.
--
When oom_kill_allocating_task is enabled, an argument task of
oom_kill_process is not selected by select_bad_process(), It's
just out_of_memory() caller task. It mean the task can be
unkillable. check it first.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a4a5439..ee00817 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -687,7 +687,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
check_panic_on_oom(constraint, gfp_mask, order);
read_lock(&tasklist_lock);
- if (sysctl_oom_kill_allocating_task) {
+ if (sysctl_oom_kill_allocating_task &&
+ !oom_unkillable_task(current, NULL, nodemask)) {
/*
* oom_kill_process() needs tasklist_lock held. If it returns
* non-zero, current could not be killed so we must fallback to
--
1.6.5.2
--
Reviewed-by: Minchan Kim <minchan.kim@gmail.com> -- Kind regards, Minchan Kim --
If kernel thread are using use_mm(), badness() return positive value.
This is not big issue because caller care it correctly. but there is
one exception, /proc/<pid>/oom_score call badness() directly and
don't care the task is regular process.
another example, /proc/1/oom_score return !0 value. but it's unkillable.
This incorrectness makes confusing to admin a bit.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
fs/proc/base.c | 5 +++--
mm/oom_kill.c | 13 +++++++------
2 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 28099a1..56b8d3e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -428,7 +428,8 @@ static const struct file_operations proc_lstats_operations = {
#endif
/* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+ nodemask_t *nodemask, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
unsigned long points = 0;
@@ -437,7 +438,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
if (pid_alive(task))
- points = badness(task, uptime.tv_sec);
+ points = badness(task, NULL, NULL, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ee00817..fcbd21b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -139,8 +139,8 @@ static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
* algorithm has been meticulously tuned to meet the principle
* of least surprise ... (be careful when you change it)
*/
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t ...Hmm. If it is a really problem, Could we solve it in proc_oom_score itself? -- Kind regards, Minchan Kim --
probably, no good idea. For maintainance view, all oom related code should be gathered in oom_kill.c. If you dislike to add messy into badness(), I hope to make badness_for_oom_score() or something like instead. --
I am looking forward to seeing your next series. Thanks, Kosaki. P.S) I think if the number of patch series is the bigger than #10, It would be better to include or point url of all-at-once patch in patch series. In case of your patch, post patches changes pre patches It could make hard review unless the reviewer merge patches into tree to see the final figure. -- Kind regards, Minchan Kim --
select_bad_process() and badness() have the same OOM_DISABLE check.
This patch kill one.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fcbd21b..f5d903f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -350,9 +350,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
*ppoints = ULONG_MAX;
}
- if (p->signal->oom_adj == OOM_DISABLE)
- continue;
-
points = badness(p, mem, nodemask, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
--
1.6.5.2
--
Reviewed-by: Minchan Kim <minchan.kim@gmail.com> -- Kind regards, Minchan Kim --
Now, if oom_kill_allocating_task is enabled and current have
OOM_DISABLED, following printk in oom_kill_process is called twice.
pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
So, OOM_DISABLE check should be more early.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f5d903f..75f6dbc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -424,7 +424,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
static int oom_kill_task(struct task_struct *p)
{
p = find_lock_task_mm(p);
- if (!p || p->signal->oom_adj == OOM_DISABLE) {
+ if (!p) {
task_unlock(p);
return 1;
}
@@ -686,7 +686,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
read_lock(&tasklist_lock);
if (sysctl_oom_kill_allocating_task &&
- !oom_unkillable_task(current, NULL, nodemask)) {
+ !oom_unkillable_task(current, NULL, nodemask) &&
+ (current->signal->oom_adj != OOM_DISABLE)) {
/*
* oom_kill_process() needs tasklist_lock held. If it returns
* non-zero, current could not be killed so we must fallback to
--
1.6.5.2
--
If we check it in oom_unkillable_task, we don't need this patch. -- Kind regards, Minchan Kim --
Yup. but please read the commnet of [3/11]. --
Now has_intersects_mems_allowed() has own thread iterate logic, but it should use while_each_thread(). It slightly improve the code readability. Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/oom_kill.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 75f6dbc..136708b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -69,8 +69,8 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk, if (cpuset_mems_allowed_intersects(current, tsk)) return true; } - tsk = next_thread(tsk); - } while (tsk != start); + } while_each_thread(start, tsk); + return false; } #else -- 1.6.5.2 --
Current "child->mm == p->mm" mean prevent to select vfork() task.
But we don't have any reason to don't consider vfork().
Remvoed.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 136708b..b5678bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
list_for_each_entry(child, &t->children, sibling) {
unsigned long child_points;
- if (child->mm == p->mm)
- continue;
-
/* badness() returns 0 if the thread is unkillable */
child_points = badness(child, mem, nodemask,
uptime.tv_sec);
--
1.6.5.2
--
I guess "child->mm == p->mm" is for losing the minimal amount of work done as comment say. But frankly speaking, I don't understand it, either. Maybe "One shot, two kill" problem? Andrea. Could you explain it? -- Kind regards, Minchan Kim --
From: Luis Claudio R. Goncalves <lclaudio@uudg.org>
In a system under heavy load it was observed that even after the
oom-killer selects a task to die, the task may take a long time to die.
Right after sending a SIGKILL to the task selected by the oom-killer
this task has it's priority increased so that it can exit() exit soon,
freeing memory. That is accomplished by:
/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
It sounds plausible giving the dying task an even higher priority to be
sure it will be scheduled sooner and free the desired memory. It was
suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
this task won't interfere with any running RT task.
If the dying task is already an RT task, leave it untouched.
Another good suggestion, implemented here, was to avoid boosting the
dying task priority in case of mem_cgroup OOM.
Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 34 +++++++++++++++++++++++++++++++---
1 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b5678bf..0858b18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -82,6 +82,24 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
#endif /* CONFIG_NUMA */
/*
+ * If this is a system OOM (not a memcg OOM) and the task selected to be
+ * killed is not already running at high (RT) priorities, speed up the
+ * recovery by boosting the dying task to the lowest FIFO priority.
+ * That helps with the recovery and avoids interfering with RT tasks.
+ */
+static void boost_dying_task_prio(struct task_struct *p,
+ struct mem_cgroup *mem)
+{
+ struct ...Sorry, I forgot to cc Luis. resend. --
Reviewed-by: Minchan Kim <minchan.kim@gmail.com> It seems code itself doesn't have a problem. So I give reviewed-by. But this patch might break fairness of normal process at corner case. If system working is more important than fairness of processes, It does make sense. But scheduler guys might have a different opinion. So at least, we need ACKs of scheduler guys. -- Kind regards, Minchan Kim --
On Wed, 30 Jun 2010 18:33:23 +0900 (JST)
We can actually make `param' static here. That saves a teeny bit of
code and a little bit of stack. The oom-killer can be called when
we're using a lot of stack.
But if we make that change we really should make the param arg to
sched_setscheduler_nocheck() be const. I did that (and was able to
convert lots of callers to use a static `param') but to complete the
job we'd need to chase through all the security goop, fixing up
security_task_setscheduler() and callees, and I got bored.
include/linux/sched.h | 2 +-
kernel/kthread.c | 2 +-
kernel/sched.c | 4 ++--
kernel/softirq.c | 4 +++-
kernel/stop_machine.c | 2 +-
kernel/workqueue.c | 2 +-
6 files changed, 9 insertions(+), 7 deletions(-)
diff -puN kernel/kthread.c~a kernel/kthread.c
--- a/kernel/kthread.c~a
+++ a/kernel/kthread.c
@@ -131,7 +131,7 @@ struct task_struct *kthread_create(int (
wait_for_completion(&create.done);
if (!IS_ERR(create.result)) {
- struct sched_param param = { .sched_priority = 0 };
+ static struct sched_param param = { .sched_priority = 0 };
va_list args;
va_start(args, namefmt);
diff -puN kernel/workqueue.c~a kernel/workqueue.c
--- a/kernel/workqueue.c~a
+++ a/kernel/workqueue.c
@@ -962,7 +962,7 @@ init_cpu_workqueue(struct workqueue_stru
static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
struct workqueue_struct *wq = cwq->wq;
const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
struct task_struct *p;
diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
--- a/kernel/stop_machine.c~a
+++ a/kernel/stop_machine.c
@@ -291,7 +291,7 @@ repeat:
static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
- struct sched_param param = { ...ok, I've finished this works. I made two patches, for-security-tree and
for-core. diffstat is below.
I'll post the patches as reply of this mail.
KOSAKI Motohiro (2):
security: add const to security_task_setscheduler()
sched: make sched_param arugment static variables in some
sched_setscheduler() caller
include/linux/sched.h | 5 +++--
include/linux/security.h | 9 +++++----
kernel/irq/manage.c | 4 +++-
kernel/kthread.c | 2 +-
kernel/sched.c | 6 +++---
kernel/softirq.c | 4 +++-
kernel/stop_machine.c | 2 +-
kernel/trace/trace_selftest.c | 2 +-
kernel/watchdog.c | 2 +-
kernel/workqueue.c | 2 +-
security/commoncap.c | 2 +-
security/security.c | 4 ++--
security/selinux/hooks.c | 3 ++-
security/smack/smack_lsm.c | 2 +-
14 files changed, 28 insertions(+), 21 deletions(-)
--
All security modules shouldn't change sched_param parameter of
security_task_setscheduler(). This is not only meaningless, but
also make harmful result if caller pass static variable.
This patch add const to it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
include/linux/security.h | 9 +++++----
security/commoncap.c | 2 +-
security/security.c | 4 ++--
security/selinux/hooks.c | 3 ++-
security/smack/smack_lsm.c | 2 +-
5 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/include/linux/security.h b/include/linux/security.h
index 5bcb395..07e94e5 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -74,7 +74,8 @@ extern int cap_file_mmap(struct file *file, unsigned long reqprot,
extern int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags);
extern int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);
-extern int cap_task_setscheduler(struct task_struct *p, int policy, struct sched_param *lp);
+extern int cap_task_setscheduler(struct task_struct *p, int policy,
+ const struct sched_param *lp);
extern int cap_task_setioprio(struct task_struct *p, int ioprio);
extern int cap_task_setnice(struct task_struct *p, int nice);
extern int cap_syslog(int type, bool from_file);
@@ -1501,7 +1502,7 @@ struct security_operations {
int (*task_getioprio) (struct task_struct *p);
int (*task_setrlimit) (unsigned int resource, struct rlimit *new_rlim);
int (*task_setscheduler) (struct task_struct *p, int policy,
- struct sched_param *lp);
+ const struct sched_param *lp);
int (*task_getscheduler) (struct task_struct *p);
int (*task_movememory) (struct task_struct *p);
int (*task_kill) (struct task_struct *p,
@@ -1750,8 +1751,8 @@ int security_task_setnice(struct task_struct *p, int nice);
int security_task_setioprio(struct task_struct *p, int ioprio);
int security_task_getioprio(struct ...Andrew Morton pointed out almost sched_setscheduler() caller are
using fixed parameter and it can be converted static. it reduce
runtume memory waste a bit.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
include/linux/sched.h | 5 +++--
kernel/irq/manage.c | 4 +++-
kernel/kthread.c | 2 +-
kernel/sched.c | 6 +++---
kernel/softirq.c | 4 +++-
kernel/stop_machine.c | 2 +-
kernel/trace/trace_selftest.c | 2 +-
kernel/watchdog.c | 2 +-
kernel/workqueue.c | 2 +-
9 files changed, 17 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfcd13e..97b7fe3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1917,9 +1917,10 @@ extern int task_nice(const struct task_struct *p);
extern int can_nice(const struct task_struct *p, const int nice);
extern int task_curr(const struct task_struct *p);
extern int idle_cpu(int cpu);
-extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
+extern int sched_setscheduler(struct task_struct *, int,
+ const struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int,
- struct sched_param *);
+ const struct sched_param *);
extern struct task_struct *idle_task(int cpu);
extern struct task_struct *curr_task(int cpu);
extern void set_curr_task(int cpu, struct task_struct *p);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 3164ba7..ce1dbb8 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -569,7 +569,9 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { }
*/
static int irq_thread(void *data)
{
- struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2, };
+ static struct sched_param param = {
+ .sched_priority = MAX_USER_RT_PRIO/2,
+ };
...This is a thread that runs on boot up to test the sched_wakeup tracer. Then it is deleted and all memory is reclaimed. Thus, this patch just took memory that was usable at run time and removed it permanently. Please Cc me on all tracing changes. -- Steve --
Confused. kernel/trace/ appears to waste resources by design, so what's the issue? I don't think this change will cause more waste. It'll consume 4 bytes Well if we're so worried about resource wastage then how about making all boot-time-only text and data reside in __init and __initdata sections rather than hanging around uselessly in memory for ever? Only that's going to be hard because we went and added pointers into .init.text from .data due to `struct tracer.selftest', which will cause a storm of section mismatch warnings. Doh, should have invoked the selftests from initcalls. That might open the opportunity of running the selftests by modprobing the selftest module, too. And I _do_ wish the selftest module was modprobeable, rather than this monstrosity: #ifdef CONFIG_FTRACE_SELFTEST /* Let selftest have access to static functions in this file */ #include "trace_selftest.c" #endif Really? Who had a tastebudectomy over there? At least call it trace_selftest.inc or something, so poor schmucks don't go scrabbling around wondering "how the hell does this thing get built oh no they didn't really go and #include it did they?" --
That would be a patch I would like :-) They are called by initcalls. The initcalls register the tracers and that is the time we call the selftest. No other time. Is there a way that we set up a function pointer to let the section The selftests are done by individual tracers at boot up. It would be Well this is also the way sched.c adds all its extra code. Making it trace_selftest.inc would make it hard to know what the hell it was. And also hard for editors to know what type of file it is, or things can be missed with a 'find . -name "*.[ch]" | xargs grep blahblah' Yes, the self tests are ugly and can probably go with an overhaul. Since we are trying to get away from the tracer plugins anyway, they will start disappearing when the plugins do. We should have some main selftests anyway. Those are for the TRACE_EVENT tests (which are not even in the trace_selftest.c file, and the function testing which currently are, as well as the latency testers. The trace_selftest.c should eventually be replaced with more compact tests for the specific types of tracing. -- Steve --
<sticks his nose in modpost.c for the first time> There are various whitelisting hacks in there based on the name of the offending symbol. Search term: DEFAULT_SYMBOL_WHITE_LIST. It'd be cleaner to just zap the tracer.selftest field altogether and No, if tracer_selftest.o was linked into vmlinux then the tests get run within do_initcalls(). If tracer_selftest.o is a module, then the tests get run at modprobe-time. The latter option may not be terribly useful Well bad luck. Of _course_ it makes a mess. It's already a mess. How's about just removing the `static' from whichever symbols tracer_selftest needs? That'd surely be better than #including a .c --
Agreed, moving things to kernel/sched/ and adding some internal.h thing could cure that, but I simply haven't gotten around to cleaning that up.. --
Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING
is per-thread flag, not per-process flag. He said,
Two threads, group-leader L and its sub-thread T. T dumps the code.
In this case both threads have ->mm != NULL, L has PF_EXITING.
The first problem is, select_bad_process() always return -1 in this
case (even if the caller is T, this doesn't matter).
The second problem is that we should add TIF_MEMDIE to T, not L.
I think we can remove this dubious PF_EXITING check. but as first step,
This patch add the protection of multi threaded issue.
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0858b18..b04e557 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -360,7 +360,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if ((p->flags & PF_EXITING) && p->mm) {
+ if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
if (p != current)
return ERR_PTR(-1UL);
--
1.6.5.2
--
