With CONFIG_FAIR_CGROUP_SCHED=y, following commands on 2.6.24-rc1 crash
the system.
$ mount -t cgroup none /cgroups
$ ./ns_exec -cm /bin/ls
"ns_exec -cm" calls clone() to clone the mount namespace and then
executes the '/bin/ls' program in the cloned child.
Some observations that Serge and I made (we have been able to reproduce
reliably, but crash logs have not been very useful)
a. If we skip the 'mount' command, there is no crash.
b. If CONFIG_FAIR_CGROUP_SCHED=n again, there is no crash (even with
'mount' command).
c. mounting just the cpu or just the ns subsystem does not
lead to a crash. You can even
mount -t cgroup -o cpu none /mnt1
mount -t cgroup -o ns none /mnt2
and you still don't get a crash
d. mount -t cgroup -o cpu,ns none /cgroup
will always cause a crash with subsequent ns_exec
ns_exec.c and config file are attached. Let us know if you need more info.
Suka
---
Crash log:
Red Hat Enterprise Linux release 4.90 (Tikanga)
Kernel 2.6.24-rc1 on an x86_64
elm3a241 login: Unable to handle kernel NULL pointer dereferenceUnable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
at 0000000000000000 RIP:
[<0000000000000000>]
[<0000000000000000>]
PGD 102d4d067 PGD 102d4d067 PUD 102c88067 PUD 102c88067 PMD 0 PMD 0
Oops: 0000 [1] Oops: 0000 [1] SMP SMP
CPU 2 CPU 2
Modules linked in:Modules linked in:
Pid: 3273, comm: ns_exec Not tainted 2.6.24-rc1 #9
Pid: 3273, comm: ns_exec Not tainted 2.6.24-rc1 #9
RIP: 0010:[<0000000000000000>] RIP: 0010:[<0000000000000000>] [<0000000000000000>]
[<0000000000000000>]
RSP: 0018:ffff8101006a6af0 EFLAGS: 00055292
RSP: 0018:ffff8101006a6af0 EFLAGS: 00055292
RAX: 0000000000000000 RBX: ffff810100d11140 RCX: ffff810101de4000
RAX: 0000000000000000 RBX: ffff810100d11140 RCX: ffff810101de4000
RDX: 0000000000000000 RSI: ffff810100d1a880 RDI: ffff810001037c00
RDX: 0000000...Thanks for reporting the problem. It was caused because of the fact that
current task isn't kept in its runqueue in case of sched_fair class
tasks.
With the patch below, I could run ns_exec w/o any crash. Can you pls
verify it works for you as well?
Ingo,
Once Suka verifies that the patch fixes his crash, I would request you
to include the same in your tree and route it to Linus.
--
current task is not present in its runqueue in case of sched_fair class
tasks. Take care of this fact in rt_mutex_setprio(),
sched_setscheduler() and sched_move_task() routines.
Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 45 +++++++++++++++++++++++++--------------------
1 files changed, 25 insertions(+), 20 deletions(-)
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -3986,11 +3986,13 @@ void rt_mutex_setprio(struct task_struct
oldprio = p->prio;
on_rq = p->se.on_rq;
running = task_running(rq, p);
- if (on_rq) {
+ if (on_rq)
dequeue_task(rq, p, 0);
- if (running)
- p->sched_class->put_prev_task(rq, p);
- }
+ /* current task is not kept in its runqueue in case of sched_fair class.
+ * Hence we need the 'on_rq?' and 'running?' tests to be separate.
+ */
+ if (running)
+ p->sched_class->put_prev_task(rq, p);
if (rt_prio(prio))
p->sched_class = &rt_sched_class;
@@ -3999,9 +4001,9 @@ void rt_mutex_setprio(struct task_struct
p->prio = prio;
+ if (running)
+ p->sched_class->set_curr_task(rq);
if (on_rq) {
- if (running)
- p->sched_class->set_curr_task(rq);
enqueue_task(rq, p, 0);
inc_load(rq, p);
/*
@@ -4298,18 +4300,20 @@ recheck:
update_rq_clock(rq);
on_rq = p->se.on_rq;
running = task_running(rq, p);
- if (on_rq) {
+ if (on_rq)
deactivate_task(rq, p, 0);
- if (running)
- p->sched_class->put_prev_task(r...You are damned right! Sorry my mistake with the previous analysis and (as I now find out) testing :( There are couple of problems discovered by Suka's test: - The test requires the cgroup filesystem to be mounted with atleast the cpu and ns options (i.e both namespace and cpu controllers are active in the same hierarchy). # mkdir /dev/cpuctl # mount -t cgroup -ocpu,ns none cpuctl (or simply) # mount -t cgroup none cpuctl -> Will activate all controllers in same hierarchy. - The test invokes clone() with CLONE_NEWNS set. This causes a a new child to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone-> cgroup_clone) and the child is attached to the new group (cgroup_clone-> attach_task->sched_move_task). At this point in time, the child's scheduler related fields are uninitialized (including its on_rq field, which it has inherited from parent). As a result sched_move_task thinks its on runqueue, when it isn't. As a solution to this problem, I moved sched_fork() call, which initializes scheduler related fields on a new task, before copy_namespaces(). I am not sure though whether moving up will cause other side-effects. Do you see any issue? - The second problem exposed by this test is that task_new_fair() assumes that parent and child will be part of the same group (which needn't be as this test shows). As a result, cfs_rq->curr can be NULL for the child. The solution is to test for curr pointer being NULL in task_new_fair(). With the patch below, I could run ns_exec() fine w/o a crash. Suka, can you verify whether this patch fixes your problem? -- Fix copy_namespace() <-> sched_fork() dependency in do_fork, by moving up sched_fork(). Also introduce a NULL pointer check for 'curr' in task_new_fair(). Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> --- kernel/fork.c | 6 +++--- kernel/sched_fair.c | 2 +- 2 files change...
-
Serge E. Hallyn [serue@us.ibm.com] wrote: | Quoting Srivatsa Vaddagiri (vatsa@linux.vnet.ibm.com): | > On Fri, Nov 09, 2007 at 09:45:21AM +0100, Dmitry Adamushko wrote: | > > Humm... the 'current' is not kept within the tree but | > > current->se.on_rq is supposed to be '1' , | > > so the old code looks ok to me (at least for the 'leaf' elements). | > | > You are damned right! Sorry my mistake with the previous analysis and | > (as I now find out) testing :( | > | > There are couple of problems discovered by Suka's test: | > | > - The test requires the cgroup filesystem to be mounted with | > atleast the cpu and ns options (i.e both namespace and cpu | > controllers are active in the same hierarchy). | > | > # mkdir /dev/cpuctl | > # mount -t cgroup -ocpu,ns none cpuctl | > (or simply) | > # mount -t cgroup none cpuctl -> Will activate all controllers | > in same hierarchy. | > | > - The test invokes clone() with CLONE_NEWNS set. This causes a a new child | > to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone-> | > cgroup_clone) and the child is attached to the new group (cgroup_clone-> | > attach_task->sched_move_task). At this point in time, the child's scheduler | > related fields are uninitialized (including its on_rq field, which it has | > inherited from parent). As a result sched_move_task thinks its on | > runqueue, when it isn't. | > | > As a solution to this problem, I moved sched_fork() call, which | > initializes scheduler related fields on a new task, before | > copy_namespaces(). I am not sure though whether moving up will | > cause other side-effects. Do you see any issue? | > | > - The second problem exposed by this test is that task_new_fair() | > assumes that parent and child will be part of the same group (which | > needn't be as this test shows). As ...
Would it be better, logically-wise, to use is_same_group() instead? Although, we can't have 2 groups with cfs_rq->curr != NULL on the same CPU... so if the child belongs to another group, it's cfs_rq->curr is automatically NULL indeed. -- Best regards, Dmitry Adamushko -
Yeah ..I feel safe with an explicit !curr check, perhaps with a comment like
below added to explain when curr can be NULL?
---
kernel/sched_fair.c | 1 +
1 files changed, 1 insertion(+)
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1022,6 +1022,7 @@ static void task_new_fair(struct rq *rq,
update_curr(cfs_rq);
place_entity(cfs_rq, se, 1);
+ /* 'curr' will be NULL if the child belongs to a different group */
if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
curr && curr->vruntime < se->vruntime) {
/*
--
Regards,
vatsa
-thanks, applied. I'll wait for confirmation from Suka before sending it to Linus. Ingo -
| Chuck Ebbert | Why do so many machines need "noapic"? |
| Renato S. Yamane | Error -71 on device descriptor read/all |
| Greg Kroah-Hartman | [PATCH 05/54] kset: convert fuse to use kset_create |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
git: | |
| R. Tyler Ballance | Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted ... |
| Shawn O. Pearce | Re: Some ideas for StGIT |
| Alexander Litvinov | git-svn does not seems to work with crlf convertion enabled. |
| Wink Saville | Resolving conflicts |
| John P Poet | Realtek 8111C transmit timed out |
| Rémi Denis-Courmont | Re: [PATCH] Security: Implement and document RLIMIT_NETWORK. |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | [GIT]: Networking |
| Jason Beaudoin | Re: Real men don't attack straw men |
| Parvinder Bhasin | BIND and CNAME-ing |
| Manuel Ravasio | Annoying problem with dnsmasq |
| Craig Skinner | Re: How can i boot a bsd.rd from windows 2000 ? |
