login
Header Space

 
 

Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

Previous thread: [PATCH mm] unionfs: clear partial read by Hugh Dickins on Friday, November 9, 2007 - 3:31 am. (2 messages)

Next thread: 2.6.24-rc1 and 2.6.24.rc2 hangs while running udev on my laptop by SANGOI DINO LEONARDO on Friday, November 9, 2007 - 4:47 am. (4 messages)
To: <vatsa@...>, <balbir@...>
Cc: Containers <containers@...>, <ckrm-tech@...>, <linux-kernel@...>
Date: Thursday, November 8, 2007 - 7:48 pm

With CONFIG_FAIR_CGROUP_SCHED=y, following commands on 2.6.24-rc1 crash
the system.

	$ mount -t cgroup none /cgroups
	$ ./ns_exec -cm /bin/ls

"ns_exec -cm" calls clone() to clone the mount namespace and then
executes the '/bin/ls' program in the cloned child.

Some observations that Serge and I made (we have been able to reproduce
reliably, but crash logs have not been very useful)

	a. If we skip the 'mount' command, there is no crash.

	b. If CONFIG_FAIR_CGROUP_SCHED=n again, there is no crash (even with 
	   'mount' command).

       	c. mounting just the cpu or just the ns subsystem does not
           lead to a crash.  You can even

           mount -t cgroup -o cpu none /mnt1
           mount -t cgroup -o ns none /mnt2

           and you still don't get a crash

        d. mount -t cgroup -o cpu,ns none /cgroup

           will always cause a crash with subsequent ns_exec

ns_exec.c and config file are attached. Let us know if you need more info.

Suka

---
Crash log:

Red Hat Enterprise Linux release 4.90 (Tikanga)
Kernel 2.6.24-rc1 on an x86_64

elm3a241 login: Unable to handle kernel NULL pointer dereferenceUnable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
 at 0000000000000000 RIP:
 [&lt;0000000000000000&gt;]
 [&lt;0000000000000000&gt;]
PGD 102d4d067 PGD 102d4d067 PUD 102c88067 PUD 102c88067 PMD 0 PMD 0

Oops: 0000 [1] Oops: 0000 [1] SMP SMP

CPU 2 CPU 2

Modules linked in:Modules linked in:

Pid: 3273, comm: ns_exec Not tainted 2.6.24-rc1 #9
Pid: 3273, comm: ns_exec Not tainted 2.6.24-rc1 #9
RIP: 0010:[&lt;0000000000000000&gt;] RIP: 0010:[&lt;0000000000000000&gt;]  [&lt;0000000000000000&gt;]
 [&lt;0000000000000000&gt;]
RSP: 0018:ffff8101006a6af0  EFLAGS: 00055292
RSP: 0018:ffff8101006a6af0  EFLAGS: 00055292
RAX: 0000000000000000 RBX: ffff810100d11140 RCX: ffff810101de4000
RAX: 0000000000000000 RBX: ffff810100d11140 RCX: ffff810101de4000
RDX: 0000000000000000 RSI: ffff810100d1a880 RDI: ffff810001037c00
RDX: 0000000...
To: <sukadev@...>
Cc: <balbir@...>, Containers <containers@...>, <ckrm-tech@...>, <linux-kernel@...>, <dhaval@...>, Ingo Molnar <mingo@...>, <dmitry.adamushko@...>, <efault@...>
Date: Friday, November 9, 2007 - 3:02 am

Thanks for reporting the problem. It was caused because of the fact that
current task isn't kept in its runqueue in case of sched_fair class
tasks.

With the patch below, I could run ns_exec w/o any crash. Can you pls
verify it works for you as well?

Ingo,
	Once Suka verifies that the patch fixes his crash, I would request you 
to include the same in your tree and route it to Linus.

--

current task is not present in its runqueue in case of sched_fair class
tasks. Take care of this fact in rt_mutex_setprio(),
sched_setscheduler() and sched_move_task() routines.

Signed-off-by : Srivatsa Vaddagiri &lt;vatsa@linux.vnet.ibm.com&gt;


---
 kernel/sched.c |   45 +++++++++++++++++++++++++--------------------
 1 files changed, 25 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -3986,11 +3986,13 @@ void rt_mutex_setprio(struct task_struct
 	oldprio = p-&gt;prio;
 	on_rq = p-&gt;se.on_rq;
 	running = task_running(rq, p);
-	if (on_rq) {
+	if (on_rq)
 		dequeue_task(rq, p, 0);
-		if (running)
-			p-&gt;sched_class-&gt;put_prev_task(rq, p);
-	}
+	/* current task is not kept in its runqueue in case of sched_fair class.
+	 * Hence we need the 'on_rq?' and 'running?' tests to be separate.
+	 */
+	if (running)
+		p-&gt;sched_class-&gt;put_prev_task(rq, p);
 
 	if (rt_prio(prio))
 		p-&gt;sched_class = &amp;rt_sched_class;
@@ -3999,9 +4001,9 @@ void rt_mutex_setprio(struct task_struct
 
 	p-&gt;prio = prio;
 
+	if (running)
+		p-&gt;sched_class-&gt;set_curr_task(rq);
 	if (on_rq) {
-		if (running)
-			p-&gt;sched_class-&gt;set_curr_task(rq);
 		enqueue_task(rq, p, 0);
 		inc_load(rq, p);
 		/*
@@ -4298,18 +4300,20 @@ recheck:
 	update_rq_clock(rq);
 	on_rq = p-&gt;se.on_rq;
 	running = task_running(rq, p);
-	if (on_rq) {
+	if (on_rq)
 		deactivate_task(rq, p, 0);
-		if (running)
-			p-&gt;sched_class-&gt;put_prev_task(r...
To: <vatsa@...>
Cc: <sukadev@...>, <balbir@...>, Containers <containers@...>, <ckrm-tech@...>, <linux-kernel@...>, <dhaval@...>, Ingo Molnar <mingo@...>, <efault@...>
Date: Friday, November 9, 2007 - 4:45 am

[Empty message]
To: Dmitry Adamushko <dmitry.adamushko@...>
Cc: <sukadev@...>, <balbir@...>, Containers <containers@...>, <ckrm-tech@...>, <linux-kernel@...>, <dhaval@...>, Ingo Molnar <mingo@...>, <efault@...>
Date: Friday, November 9, 2007 - 6:14 am

You are damned right! Sorry my mistake with the previous analysis and
(as I now find out) testing :(

There are couple of problems discovered by Suka's test:

- The test requires the cgroup filesystem to be mounted with
  atleast the cpu and ns options (i.e both namespace and cpu 
  controllers are active in the same hierarchy). 

	# mkdir /dev/cpuctl
	# mount -t cgroup -ocpu,ns none cpuctl
	(or simply)
	# mount -t cgroup none cpuctl -&gt; Will activate all controllers
					 in same hierarchy.

- The test invokes clone() with CLONE_NEWNS set. This causes a a new child
  to be created, also a new group (do_fork-&gt;copy_namespaces-&gt;ns_cgroup_clone-&gt;
  cgroup_clone) and the child is attached to the new group (cgroup_clone-&gt;
  attach_task-&gt;sched_move_task). At this point in time, the child's scheduler 
  related fields are uninitialized (including its on_rq field, which it has
  inherited from parent). As a result sched_move_task thinks its on
  runqueue, when it isn't.

  As a solution to this problem, I moved sched_fork() call, which
  initializes scheduler related fields on a new task, before
  copy_namespaces(). I am not sure though whether moving up will
  cause other side-effects. Do you see any issue?

- The second problem exposed by this test is that task_new_fair()
  assumes that parent and child will be part of the same group (which 
  needn't be as this test shows). As a result, cfs_rq-&gt;curr can be NULL
  for the child.

  The solution is to test for curr pointer being NULL in
  task_new_fair().


With the patch below, I could run ns_exec() fine w/o a crash.

Suka, can you verify whether this patch fixes your problem?


--

Fix copy_namespace() &lt;-&gt; sched_fork() dependency in do_fork, by moving
up sched_fork().

Also introduce a NULL pointer check for 'curr' in task_new_fair().

Signed-off-by : Srivatsa Vaddagiri &lt;vatsa@linux.vnet.ibm.com&gt;

---
 kernel/fork.c       |    6 +++---
 kernel/sched_fair.c |    2 +-
 2 files change...
To: Srivatsa Vaddagiri <vatsa@...>
Cc: Dmitry Adamushko <dmitry.adamushko@...>, <dhaval@...>, <ckrm-tech@...>, <efault@...>, <linux-kernel@...>, Containers <containers@...>, Ingo Molnar <mingo@...>
Date: Friday, November 9, 2007 - 12:05 pm

-
To: Serge E. Hallyn <serue@...>
Cc: Srivatsa Vaddagiri <vatsa@...>, <dhaval@...>, <ckrm-tech@...>, <efault@...>, <linux-kernel@...>, Dmitry Adamushko <dmitry.adamushko@...>, Containers <containers@...>, Ingo Molnar <mingo@...>
Date: Saturday, November 10, 2007 - 7:13 pm

Serge E. Hallyn [serue@us.ibm.com] wrote:
| Quoting Srivatsa Vaddagiri (vatsa@linux.vnet.ibm.com):
| &gt; On Fri, Nov 09, 2007 at 09:45:21AM +0100, Dmitry Adamushko wrote:
| &gt; &gt; Humm... the 'current' is not kept within the tree but
| &gt; &gt; current-&gt;se.on_rq is supposed to be '1' ,
| &gt; &gt; so the old code looks ok to me (at least for the 'leaf' elements).
| &gt; 
| &gt; You are damned right! Sorry my mistake with the previous analysis and
| &gt; (as I now find out) testing :(
| &gt; 
| &gt; There are couple of problems discovered by Suka's test:
| &gt; 
| &gt; - The test requires the cgroup filesystem to be mounted with
| &gt;   atleast the cpu and ns options (i.e both namespace and cpu 
| &gt;   controllers are active in the same hierarchy). 
| &gt; 
| &gt; 	# mkdir /dev/cpuctl
| &gt; 	# mount -t cgroup -ocpu,ns none cpuctl
| &gt; 	(or simply)
| &gt; 	# mount -t cgroup none cpuctl -&gt; Will activate all controllers
| &gt; 					 in same hierarchy.
| &gt; 
| &gt; - The test invokes clone() with CLONE_NEWNS set. This causes a a new child
| &gt;   to be created, also a new group (do_fork-&gt;copy_namespaces-&gt;ns_cgroup_clone-&gt;
| &gt;   cgroup_clone) and the child is attached to the new group (cgroup_clone-&gt;
| &gt;   attach_task-&gt;sched_move_task). At this point in time, the child's scheduler 
| &gt;   related fields are uninitialized (including its on_rq field, which it has
| &gt;   inherited from parent). As a result sched_move_task thinks its on
| &gt;   runqueue, when it isn't.
| &gt; 
| &gt;   As a solution to this problem, I moved sched_fork() call, which
| &gt;   initializes scheduler related fields on a new task, before
| &gt;   copy_namespaces(). I am not sure though whether moving up will
| &gt;   cause other side-effects. Do you see any issue?
| &gt; 
| &gt; - The second problem exposed by this test is that task_new_fair()
| &gt;   assumes that parent and child will be part of the same group (which 
| &gt;   needn't be as this test shows). As ...
To: <vatsa@...>
Cc: <sukadev@...>, <balbir@...>, Containers <containers@...>, <ckrm-tech@...>, <linux-kernel@...>, <dhaval@...>, Ingo Molnar <mingo@...>, <efault@...>
Date: Friday, November 9, 2007 - 6:59 am

Would it be better, logically-wise, to use is_same_group() instead?
Although, we can't have 2 groups with cfs_rq-&gt;curr != NULL on the same
CPU... so if the child belongs to another group, it's cfs_rq-&gt;curr is
automatically NULL indeed.



-- 
Best regards,
Dmitry Adamushko
-
To: Dmitry Adamushko <dmitry.adamushko@...>
Cc: <sukadev@...>, <balbir@...>, Containers <containers@...>, <ckrm-tech@...>, <linux-kernel@...>, <dhaval@...>, Ingo Molnar <mingo@...>, <efault@...>
Date: Friday, November 9, 2007 - 8:11 am

Yeah ..I feel safe with an explicit !curr check, perhaps with a comment like
below added to explain when curr can be NULL?


---
 kernel/sched_fair.c |    1 +
 1 files changed, 1 insertion(+)

Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1022,6 +1022,7 @@ static void task_new_fair(struct rq *rq,
 	update_curr(cfs_rq);
 	place_entity(cfs_rq, se, 1);
 
+	/* 'curr' will be NULL if the child belongs to a different group */
 	if (sysctl_sched_child_runs_first &amp;&amp; this_cpu == task_cpu(p) &amp;&amp;
 			curr &amp;&amp; curr-&gt;vruntime &lt; se-&gt;vruntime) {
 		/*



-- 
Regards,
vatsa
-
To: Srivatsa Vaddagiri <vatsa@...>
Cc: Dmitry Adamushko <dmitry.adamushko@...>, <sukadev@...>, <balbir@...>, Containers <containers@...>, <ckrm-tech@...>, <linux-kernel@...>, <dhaval@...>, <efault@...>
Date: Friday, November 9, 2007 - 6:25 am

thanks, applied. I'll wait for confirmation from Suka before sending it 
to Linus.

	Ingo
-
Previous thread: [PATCH mm] unionfs: clear partial read by Hugh Dickins on Friday, November 9, 2007 - 3:31 am. (2 messages)

Next thread: 2.6.24-rc1 and 2.6.24.rc2 hangs while running udev on my laptop by SANGOI DINO LEONARDO on Friday, November 9, 2007 - 4:47 am. (4 messages)
speck-geostationary