Re: [PATCH 2/2] sched: make sched_param arugment static variables in some sched_setscheduler() caller

Previous thread: by Don Gauthier on Wednesday, June 30, 2010 - 2:17 am. (1 message)

Next thread: [PATCH] KEYS: Reinstate lost passing of process keyring ID in call_sbin_request_key() by David Howells on Wednesday, June 30, 2010 - 2:39 am. (2 messages)
From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:25 am

Hi

Here is updated series for various OOM fixes.
Almost fixes are trivial. 

One big improvement is Luis's dying task priority boost patch.
This is necessary for RT folks.


  oom: don't try to kill oom_unkillable child
  oom: oom_kill_process() doesn't select kthread child
  oom: make oom_unkillable_task() helper function
  oom: oom_kill_process() need to check p is unkillable
  oom: /proc/<pid>/oom_score treat kernel thread honestly
  oom: kill duplicate OOM_DISABLE check
  oom: move OOM_DISABLE check from oom_kill_task to out_of_memory()
  oom: cleanup has_intersects_mems_allowed()
  oom: remove child->mm check from oom_kill_process()
  oom: give the dying task a higher priority
  oom: multi threaded process coredump don't make deadlock

 fs/proc/base.c |    5 ++-
 mm/oom_kill.c  |  100 +++++++++++++++++++++++++++++++++++++++-----------------
 2 files changed, 73 insertions(+), 32 deletions(-)

--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:27 am

Now, badness() doesn't care neigher CPUSET nor mempolicy. Then
if the victim child process have disjoint nodemask, OOM Killer might
kill innocent process.

This patch fixes it.

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26ae697..0aeacb2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -429,7 +429,7 @@ static int oom_kill_task(struct task_struct *p)
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
-			    const char *message)
+			    nodemask_t *nodemask, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -469,6 +469,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 				continue;
 			if (mem && !task_in_mem_cgroup(child, mem))
 				continue;
+			if (!has_intersects_mems_allowed(child, nodemask))
+				continue;
 
 			/* badness() returns 0 if the thread is unkillable */
 			child_points = badness(child, uptime.tv_sec);
@@ -519,7 +521,7 @@ retry:
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+	if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -678,7 +680,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		 * non-zero, current could not be killed so we must fallback to
 		 * the tasklist scan.
 		 */
-		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
+		if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask,
 				"Out of memory (oom_kill_allocating_task)"))
 			return;
 	}
@@ -697,7 +699,7 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	if ...
From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:27 am

Now, select_bad_process() have PF_KTHREAD check, but oom_kill_process
doesn't. It mean oom_kill_process() may choose wrong task, especially,
when the child are using use_mm().

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0aeacb2..dc8589e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -467,6 +467,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 			if (child->mm == p->mm)
 				continue;
+			if (child->flags & PF_KTHREAD)
+				continue;
 			if (mem && !task_in_mem_cgroup(child, mem))
 				continue;
 			if (!has_intersects_mems_allowed(child, nodemask))
-- 
1.6.5.2



--

From: Minchan Kim
Date: Wednesday, June 30, 2010 - 6:55 am

Is it possible child is kthread even though parent isn't kthread?

-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 5:07 pm

Usually unhappen. but crappy driver can do any strange thing freely.
As I said, oom code should have conservative assumption as far as possible.



--

From: Minchan Kim
Date: Thursday, July 1, 2010 - 6:38 am

Okay. You change the check with oom_unkillable_task at last. 
The oom_unkillable_task is generic function so that the kthread check in
oom_kill_process is tivial, I think. 


-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:28 am

Now, we have the same task check in two places. Unify it.

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |   33 ++++++++++++++++++++++-----------
 1 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dc8589e..a4a5439 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -101,6 +101,26 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
 	return NULL;
 }
 
+/* return true if the task is not adequate as candidate victim task. */
+static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
+			   const nodemask_t *nodemask)
+{
+	if (is_global_init(p))
+		return true;
+	if (p->flags & PF_KTHREAD)
+		return true;
+
+	/* When mem_cgroup_out_of_memory() and p is not member of the group */
+	if (mem && !task_in_mem_cgroup(p, mem))
+		return true;
+
+	/* p may not have freeable memory in nodemask */
+	if (!has_intersects_mems_allowed(p, nodemask))
+		return true;
+
+	return false;
+}
+
 /**
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
@@ -295,12 +315,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	for_each_process(p) {
 		unsigned long points;
 
-		/* skip the init task and kthreads */
-		if (is_global_init(p) || (p->flags & PF_KTHREAD))
-			continue;
-		if (mem && !task_in_mem_cgroup(p, mem))
-			continue;
-		if (!has_intersects_mems_allowed(p, nodemask))
+		if (oom_unkillable_task(p, mem, nodemask))
 			continue;
 
 		/*
@@ -467,11 +482,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 			if (child->mm == p->mm)
 				continue;
-			if (child->flags & PF_KTHREAD)
-				continue;
-			if (mem && !task_in_mem_cgroup(child, mem))
-				continue;
-			if (!has_intersects_mems_allowed(child, nodemask))
+			if (oom_unkillable_task(p, mem, nodemask))
 ...
From: Minchan Kim
Date: Wednesday, June 30, 2010 - 7:19 am

I returend this patch as review 7/11. 
Why didn't you check p->signal->oom_adj == OOM_DISABLE in here?
I don't figure out code after your patches are applied totally.
But I think it would be check it in this function as function's name says.


-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 5:07 pm

For preserve select_bad_process() semantics. It have

        for_each_process(p) {
                if (oom_unkillable_task(p, mem, nodemask))
			continue;

                if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
                        if (p != current)
                                return ERR_PTR(-1UL);

                        chosen = p;
                        *ppoints = ULONG_MAX;
                }

	        if (oom_adj == OOM_DISABLE)
			continue;

That said, Current OOM-Killer intend to kill PF_EXITING process even if
it have OOM_DISABLE. (practically, it's not kill. it only affect to give 
allocation bonus to PF_EXITING process)

My trivial fixes series don't intend to make large semantics change.



--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:29 am

When oom_kill_allocating_task is enabled, an argument task of
oom_kill_process is not selected by select_bad_process(), It's
just out_of_memory() caller task. It mean the task can be
unkillable. check it first.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a4a5439..ee00817 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -687,7 +687,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	check_panic_on_oom(constraint, gfp_mask, order);
 
 	read_lock(&tasklist_lock);
-	if (sysctl_oom_kill_allocating_task) {
+	if (sysctl_oom_kill_allocating_task &&
+	    !oom_unkillable_task(current, NULL, nodemask)) {
 		/*
 		 * oom_kill_process() needs tasklist_lock held.  If it returns
 		 * non-zero, current could not be killed so we must fallback to
-- 
1.6.5.2



--

From: Minchan Kim
Date: Wednesday, June 30, 2010 - 6:57 am

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:30 am

If kernel thread are using use_mm(), badness() return positive value.
This is not big issue because caller care it correctly. but there is
one exception, /proc/<pid>/oom_score call badness() directly and
don't care the task is regular process.

another example, /proc/1/oom_score return !0 value. but it's unkillable.
This incorrectness makes confusing to admin a bit.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c |    5 +++--
 mm/oom_kill.c  |   13 +++++++------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 28099a1..56b8d3e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -428,7 +428,8 @@ static const struct file_operations proc_lstats_operations = {
 #endif
 
 /* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+		      nodemask_t *nodemask, unsigned long uptime);
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points = 0;
@@ -437,7 +438,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
 	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
 	if (pid_alive(task))
-		points = badness(task, uptime.tv_sec);
+		points = badness(task, NULL, NULL, uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ee00817..fcbd21b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -139,8 +139,8 @@ static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
  *    algorithm has been meticulously tuned to meet the principle
  *    of least surprise ... (be careful when you change it)
  */
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+		      const nodemask_t ...
From: Minchan Kim
Date: Wednesday, June 30, 2010 - 7:03 am

Hmm. If it is a really problem, Could we solve it in proc_oom_score itself?

-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 5:07 pm

probably, no good idea. For maintainance view, all oom related code should
be gathered in oom_kill.c.
If you dislike to add messy into badness(), I hope to make badness_for_oom_score()
or something like instead.



--

From: Minchan Kim
Date: Thursday, July 1, 2010 - 7:36 am

I am looking forward to seeing your next series.
Thanks, Kosaki. 

P.S) 
I think if the number of patch series is the bigger than #10, 
It would be better to include or point url of all-at-once patch 
in patch series.

In case of your patch, post patches changes pre patches
It could make hard review unless the reviewer merge patches into tree to
see the final figure. 

-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:31 am

select_bad_process() and badness() have the same OOM_DISABLE check.
This patch kill one.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fcbd21b..f5d903f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -350,9 +350,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			*ppoints = ULONG_MAX;
 		}
 
-		if (p->signal->oom_adj == OOM_DISABLE)
-			continue;
-
 		points = badness(p, mem, nodemask, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
-- 
1.6.5.2



--

From: Minchan Kim
Date: Wednesday, June 30, 2010 - 7:10 am

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:31 am

Now, if oom_kill_allocating_task is enabled and current have
OOM_DISABLED, following printk in oom_kill_process is called twice.

    pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
            message, task_pid_nr(p), p->comm, points);

So, OOM_DISABLE check should be more early.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f5d903f..75f6dbc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -424,7 +424,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 static int oom_kill_task(struct task_struct *p)
 {
 	p = find_lock_task_mm(p);
-	if (!p || p->signal->oom_adj == OOM_DISABLE) {
+	if (!p) {
 		task_unlock(p);
 		return 1;
 	}
@@ -686,7 +686,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 
 	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task &&
-	    !oom_unkillable_task(current, NULL, nodemask)) {
+	    !oom_unkillable_task(current, NULL, nodemask) &&
+	    (current->signal->oom_adj != OOM_DISABLE)) {
 		/*
 		 * oom_kill_process() needs tasklist_lock held.  If it returns
 		 * non-zero, current could not be killed so we must fallback to
-- 
1.6.5.2



--

From: Minchan Kim
Date: Wednesday, June 30, 2010 - 7:20 am

If we check it in oom_unkillable_task, we don't need this patch. 

-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 5:07 pm

Yup. but please read the commnet of [3/11].



--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:32 am

Now has_intersects_mems_allowed() has own thread iterate logic, but
it should use while_each_thread().

It slightly improve the code readability.

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 75f6dbc..136708b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -69,8 +69,8 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
 			if (cpuset_mems_allowed_intersects(current, tsk))
 				return true;
 		}
-		tsk = next_thread(tsk);
-	} while (tsk != start);
+	} while_each_thread(start, tsk);
+
 	return false;
 }
 #else
-- 
1.6.5.2



--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:32 am

Current "child->mm == p->mm" mean prevent to select vfork() task.
But we don't have any reason to don't consider vfork().

Remvoed.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 136708b..b5678bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned long child_points;
 
-			if (child->mm == p->mm)
-				continue;
-
 			/* badness() returns 0 if the thread is unkillable */
 			child_points = badness(child, mem, nodemask,
 					       uptime.tv_sec);
-- 
1.6.5.2



--

From: Minchan Kim
Date: Wednesday, June 30, 2010 - 7:30 am

I guess "child->mm == p->mm" is for losing the minimal amount of work done
as comment say. But frankly speaking, I don't understand it, either.
Maybe "One shot, two kill" problem?

Andrea. Could you explain it?

-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:33 am

From: Luis Claudio R. Goncalves <lclaudio@uudg.org>

In a system under heavy load it was observed that even after the
oom-killer selects a task to die, the task may take a long time to die.

Right after sending a SIGKILL to the task selected by the oom-killer
this task has it's priority increased so that it can exit() exit soon,
freeing memory. That is accomplished by:

        /*
         * We give our sacrificial lamb high priority and access to
         * all the memory it needs. That way it should be able to
         * exit() and clear out its resources quickly...
         */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);

It sounds plausible giving the dying task an even higher priority to be
sure it will be scheduled sooner and free the desired memory. It was
suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
this task won't interfere with any running RT task.

If the dying task is already an RT task, leave it untouched.
Another good suggestion, implemented here, was to avoid boosting the
dying task priority in case of mem_cgroup OOM.

Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |   34 +++++++++++++++++++++++++++++++---
 1 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b5678bf..0858b18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -82,6 +82,24 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
 #endif /* CONFIG_NUMA */
 
 /*
+ * If this is a system OOM (not a memcg OOM) and the task selected to be
+ * killed is not already running at high (RT) priorities, speed up the
+ * recovery by boosting the dying task to the lowest FIFO priority.
+ * That helps with the recovery and avoids interfering with RT tasks.
+ */
+static void boost_dying_task_prio(struct task_struct *p,
+				  struct mem_cgroup *mem)
+{
+	struct ...
From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:35 am

Sorry, I forgot to cc Luis. resend.





--

From: Minchan Kim
Date: Wednesday, June 30, 2010 - 7:40 am

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

It seems code itself doesn't have a problem.
So I give reviewed-by.
But this patch might break fairness of normal process at corner case.
If system working is more important than fairness of processes,
It does make sense. But scheduler guys might have a different opinion.

So at least, we need ACKs of scheduler guys.

-- 
Kind regards,
Minchan Kim
--

From: Andrew Morton
Date: Friday, July 2, 2010 - 2:49 pm

On Wed, 30 Jun 2010 18:33:23 +0900 (JST)

We can actually make `param' static here.  That saves a teeny bit of
code and a little bit of stack.  The oom-killer can be called when
we're using a lot of stack.

But if we make that change we really should make the param arg to
sched_setscheduler_nocheck() be const.  I did that (and was able to
convert lots of callers to use a static `param') but to complete the
job we'd need to chase through all the security goop, fixing up
security_task_setscheduler() and callees, and I got bored.


 include/linux/sched.h |    2 +-
 kernel/kthread.c      |    2 +-
 kernel/sched.c        |    4 ++--
 kernel/softirq.c      |    4 +++-
 kernel/stop_machine.c |    2 +-
 kernel/workqueue.c    |    2 +-
 6 files changed, 9 insertions(+), 7 deletions(-)

diff -puN kernel/kthread.c~a kernel/kthread.c
--- a/kernel/kthread.c~a
+++ a/kernel/kthread.c
@@ -131,7 +131,7 @@ struct task_struct *kthread_create(int (
 	wait_for_completion(&create.done);
 
 	if (!IS_ERR(create.result)) {
-		struct sched_param param = { .sched_priority = 0 };
+		static struct sched_param param = { .sched_priority = 0 };
 		va_list args;
 
 		va_start(args, namefmt);
diff -puN kernel/workqueue.c~a kernel/workqueue.c
--- a/kernel/workqueue.c~a
+++ a/kernel/workqueue.c
@@ -962,7 +962,7 @@ init_cpu_workqueue(struct workqueue_stru
 
 static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 {
-	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+	static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 	struct workqueue_struct *wq = cwq->wq;
 	const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
 	struct task_struct *p;
diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
--- a/kernel/stop_machine.c~a
+++ a/kernel/stop_machine.c
@@ -291,7 +291,7 @@ repeat:
 static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
 					   unsigned long action, void *hcpu)
 {
-	struct sched_param param = { ...
From: KOSAKI Motohiro
Date: Monday, July 5, 2010 - 5:49 pm

ok, I've finished this works. I made two patches, for-security-tree and
for-core. diffstat is below.

I'll post the patches as reply of this mail.


KOSAKI Motohiro (2):
  security: add const to security_task_setscheduler()
  sched: make sched_param arugment static variables in some
    sched_setscheduler() caller

 include/linux/sched.h         |    5 +++--
 include/linux/security.h      |    9 +++++----
 kernel/irq/manage.c           |    4 +++-
 kernel/kthread.c              |    2 +-
 kernel/sched.c                |    6 +++---
 kernel/softirq.c              |    4 +++-
 kernel/stop_machine.c         |    2 +-
 kernel/trace/trace_selftest.c |    2 +-
 kernel/watchdog.c             |    2 +-
 kernel/workqueue.c            |    2 +-
 security/commoncap.c          |    2 +-
 security/security.c           |    4 ++--
 security/selinux/hooks.c      |    3 ++-
 security/smack/smack_lsm.c    |    2 +-
 14 files changed, 28 insertions(+), 21 deletions(-)




--

From: KOSAKI Motohiro
Date: Monday, July 5, 2010 - 5:50 pm

All security modules shouldn't change sched_param parameter of
security_task_setscheduler(). This is not only meaningless, but
also make harmful result if caller pass static variable.

This patch add const to it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/security.h   |    9 +++++----
 security/commoncap.c       |    2 +-
 security/security.c        |    4 ++--
 security/selinux/hooks.c   |    3 ++-
 security/smack/smack_lsm.c |    2 +-
 5 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/linux/security.h b/include/linux/security.h
index 5bcb395..07e94e5 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -74,7 +74,8 @@ extern int cap_file_mmap(struct file *file, unsigned long reqprot,
 extern int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags);
 extern int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 			  unsigned long arg4, unsigned long arg5);
-extern int cap_task_setscheduler(struct task_struct *p, int policy, struct sched_param *lp);
+extern int cap_task_setscheduler(struct task_struct *p, int policy,
+				 const struct sched_param *lp);
 extern int cap_task_setioprio(struct task_struct *p, int ioprio);
 extern int cap_task_setnice(struct task_struct *p, int nice);
 extern int cap_syslog(int type, bool from_file);
@@ -1501,7 +1502,7 @@ struct security_operations {
 	int (*task_getioprio) (struct task_struct *p);
 	int (*task_setrlimit) (unsigned int resource, struct rlimit *new_rlim);
 	int (*task_setscheduler) (struct task_struct *p, int policy,
-				  struct sched_param *lp);
+				  const struct sched_param *lp);
 	int (*task_getscheduler) (struct task_struct *p);
 	int (*task_movememory) (struct task_struct *p);
 	int (*task_kill) (struct task_struct *p,
@@ -1750,8 +1751,8 @@ int security_task_setnice(struct task_struct *p, int nice);
 int security_task_setioprio(struct task_struct *p, int ioprio);
 int security_task_getioprio(struct ...
From: KOSAKI Motohiro
Date: Monday, July 5, 2010 - 5:51 pm

Andrew Morton pointed out almost sched_setscheduler() caller are
using fixed parameter and it can be converted static. it reduce
runtume memory waste a bit.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/sched.h         |    5 +++--
 kernel/irq/manage.c           |    4 +++-
 kernel/kthread.c              |    2 +-
 kernel/sched.c                |    6 +++---
 kernel/softirq.c              |    4 +++-
 kernel/stop_machine.c         |    2 +-
 kernel/trace/trace_selftest.c |    2 +-
 kernel/watchdog.c             |    2 +-
 kernel/workqueue.c            |    2 +-
 9 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfcd13e..97b7fe3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1917,9 +1917,10 @@ extern int task_nice(const struct task_struct *p);
 extern int can_nice(const struct task_struct *p, const int nice);
 extern int task_curr(const struct task_struct *p);
 extern int idle_cpu(int cpu);
-extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
+extern int sched_setscheduler(struct task_struct *, int,
+			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
-				      struct sched_param *);
+				      const struct sched_param *);
 extern struct task_struct *idle_task(int cpu);
 extern struct task_struct *curr_task(int cpu);
 extern void set_curr_task(int cpu, struct task_struct *p);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 3164ba7..ce1dbb8 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -569,7 +569,9 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { }
  */
 static int irq_thread(void *data)
 {
-	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2, };
+	static struct sched_param param = {
+		.sched_priority = MAX_USER_RT_PRIO/2,
+	};
 ...
From: Steven Rostedt
Date: Tuesday, July 6, 2010 - 3:13 pm

This is a thread that runs on boot up to test the sched_wakeup tracer.
Then it is deleted and all memory is reclaimed.

Thus, this patch just took memory that was usable at run time and
removed it permanently.

Please Cc me on all tracing changes.

-- Steve


--

From: Andrew Morton
Date: Tuesday, July 6, 2010 - 4:12 pm

Confused.  kernel/trace/ appears to waste resources by design, so what's
the issue?

I don't think this change will cause more waste.  It'll consume 4 bytes

Well if we're so worried about resource wastage then how about making
all boot-time-only text and data reside in __init and __initdata
sections rather than hanging around uselessly in memory for ever?

Only that's going to be hard because we went and added pointers into
.init.text from .data due to `struct tracer.selftest', which will cause
a storm of section mismatch warnings.  Doh, should have invoked the
selftests from initcalls.  That might open the opportunity of running
the selftests by modprobing the selftest module, too.

And I _do_ wish the selftest module was modprobeable, rather than this
monstrosity:

#ifdef CONFIG_FTRACE_SELFTEST
/* Let selftest have access to static functions in this file */
#include "trace_selftest.c"
#endif

Really?  Who had a tastebudectomy over there?  At least call it
trace_selftest.inc or something, so poor schmucks don't go scrabbling
around wondering "how the hell does this thing get built oh no they
didn't really go and #include it did they?"


--

From: Steven Rostedt
Date: Tuesday, July 6, 2010 - 4:49 pm

That would be a patch I would like :-)


They are called by initcalls. The initcalls register the tracers and
that is the time we call the selftest. No other time.

Is there a way that we set up a function pointer to let the section

The selftests are done by individual tracers at boot up. It would be


Well this is also the way sched.c adds all its extra code. Making it
trace_selftest.inc would make it hard to know what the hell it was. And
also hard for editors to know what type of file it is, or things can be
missed with a 'find . -name "*.[ch]" | xargs grep blahblah'

Yes, the self tests are ugly and can probably go with an overhaul. Since
we are trying to get away from the tracer plugins anyway, they will
start disappearing when the plugins do.

We should have some main selftests anyway. Those are for the TRACE_EVENT
tests (which are not even in the trace_selftest.c file, and the function
testing which currently are, as well as the latency testers.

The trace_selftest.c should eventually be replaced with more compact
tests for the specific types of tracing.

-- Steve


--

From: Andrew Morton
Date: Tuesday, July 6, 2010 - 5:02 pm

<sticks his nose in modpost.c for the first time>

There are various whitelisting hacks in there based on the name of the
offending symbol.  Search term: DEFAULT_SYMBOL_WHITE_LIST.

It'd be cleaner to just zap the tracer.selftest field altogether and

No, if tracer_selftest.o was linked into vmlinux then the tests get run
within do_initcalls().  If tracer_selftest.o is a module, then the tests
get run at modprobe-time.  The latter option may not be terribly useful



Well bad luck.  Of _course_ it makes a mess.  It's already a mess.

How's about just removing the `static' from whichever symbols
tracer_selftest needs?  That'd surely be better than #including a .c

--

From: Peter Zijlstra
Date: Wednesday, July 7, 2010 - 12:43 pm

Agreed, moving things to kernel/sched/ and adding some internal.h thing
could cure that, but I simply haven't gotten around to cleaning that
up..
--

From: KOSAKI Motohiro
Date: Wednesday, June 30, 2010 - 2:34 am

Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING
is per-thread flag, not per-process flag. He said,

   Two threads, group-leader L and its sub-thread T. T dumps the code.
   In this case both threads have ->mm != NULL, L has PF_EXITING.

   The first problem is, select_bad_process() always return -1 in this
   case (even if the caller is T, this doesn't matter).

   The second problem is that we should add TIF_MEMDIE to T, not L.

I think we can remove this dubious PF_EXITING check. but as first step,
This patch add the protection of multi threaded issue.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0858b18..b04e557 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -360,7 +360,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if ((p->flags & PF_EXITING) && p->mm) {
+		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 
-- 
1.6.5.2



--

Previous thread: by Don Gauthier on Wednesday, June 30, 2010 - 2:17 am. (1 message)

Next thread: [PATCH] KEYS: Reinstate lost passing of process keyring ID in call_sbin_request_key() by David Howells on Wednesday, June 30, 2010 - 2:39 am. (2 messages)