Only to add that 2.6.25-rc1 is still broken.
thanks,
--alessandro
"We act as though comfort and luxury were the chief requirements
of life, when all that we need to make us really happy is
something to be enthusiastic about."(Charles Kingsley)
--
Yes, I think this is the same problem. Please try to unset CONFIG_GROUP_SCHED
Yes, it is.
Thanks,
Rafael
--
Ok, that's git ID's
b47711bfbcd4eb77ca61ef0162487b20e023ae55 2.6.24-git1
9b73e76f3cf63379dcf45fcd4f112f5812418d0a 2.6.24-git2so if you get a git tree, you can do
gitk b47711b..9b73e76
to see what happened in there.
However, the obvious candidates are the scheduler or the ocfs2 merge, and
the latter is only relevant in case you use ocfs2, of course.The rest of it tends to be the DVB and SCSI updates.
But it would be great if you could do a bisect and verify. Just do
git bisect start
git bisect good b47711bfbcd4eb77ca61ef0162487b20e023ae55
git bisect bad 9b73e76f3cf63379dcf45fcd4f112f5812418d0aand off you go..
Linus
--
Well, I've already bisected that down to commit
6f505b16425a51270058e4a93441fe64de3dd435 "sched: rt group scheduling" and
provided a simple test case. Moreover, there are patches from Peter that fix
the problem, but they are lost somewhere in the way from him to you (please see
http://lkml.org/lkml/2008/2/5/535 and http://lkml.org/lkml/2008/2/6/320).Thanks,
Rafael
--
no, they were not lost, they just didnt pass QA here (they crashed on a
particularly hard to debug 8-way box i have) and Peter worked on that
queue of fixes up until today to get it really correct. Could you check:git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
combo patch below as well - whichever you prefer. The shortlog can be
found below as well - but i dont yet consider this pullable, i'd like it
to see pass a full night of randconfig tests on my test-systems.this all was complicated by the fact that people found interactivity of
the group scheduler to be not up to their expectations (the observed
latencies would go up linearly with the number of UIDs in the system).
And your positive test results depended on the presence of those (at
that time, still half-baken) changes. That issue turned out to be due to
a .24-era design bug in the group scheduler and it took Peter a longer
time to straighten it out. (he flattened the rbtree to maintain perfect
weights and to get latencies back within the target.)So what fixed your thing was simply not pullable into -rc1, as -rc1 was
being cooled down at around last Friday already. Sorry about that! The
short-term emergency fix would be to turn off the group scheduler
altogether. (or, if we are lucky, tomorrow i can send the pull request
for the changes below.)Ingo
------------------>
Peter Zijlstra (15):
hrtimer: more hrtimer_init_sleeper() fallout.
sched: fair-group: separate tg->shares from task_group_lock
sched: fix incorrect irq lock usage in normalize_rt_tasks()
sched: rt-group: deal with PI
sched: rt-group: interface
sched: rt-group: make rt groups scheduling configurable
sched: rt-group: clean up the ifdeffery
sched: rt-group: refure unrunnable tasks
sched: rt-group: synchonised bandwidth period
sched: rt-group: smp balancing
sched: cleanup old and rarely used 'debug' features.
sched: fair-gro...
ok, we just found the reason for the 8-way crash, the delta fix from
Peter is below if any of you have tried the previous combo patch.
Updated sched.git as well, new HEAD is
fec13e45305d69fd0bd23b30bd05a0a42cf341f8.Ingo
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -219,6 +219,10 @@ static void start_rt_bandwidth(struct rt
if (rt_b->rt_runtime == RUNTIME_INF)
return;+ if (hrtimer_active(&rt_b->rt_period_timer))
+ return;
+
+ spin_lock(&rt_b->rt_runtime_lock);
for (;;) {
if (hrtimer_active(&rt_b->rt_period_timer))
break;
@@ -229,6 +233,7 @@ static void start_rt_bandwidth(struct rt
rt_b->rt_period_timer.expires,
HRTIMER_MODE_ABS);
}
+ spin_unlock(&rt_b->rt_runtime_lock);
}#ifdef CONFIG_RT_GROUP_SCHED
--
With the previous patch and this patch applied, the issue is not reproducible
here.Thanks,
--
Did you enable CONFIG_RT_GROUP_SCHED (it defaults to n)?
If you didn't, could you try with it set to y?
--
Tested with CONFIG_RT_GROUP_SCHED set and it also works as expected.
Thanks,
Rafael
--
I just rebuilt 2.6.25-rc1-git2 with Ingo's patch and your patch on top,
and the Oracle VKTM issue is still gone even with[asuardi@sandman ~]$ grep GROUP_SCHED
/share/src/linux-2.6.25-rc1-git2-orafix/.config
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_CGROUP_SCHED is not setso it's good for me.
Or is it necessary to also enable CONFIG_CGROUP_SCHED and retest ?
--alessandro
"We act as though comfort and luxury were the chief requirements
of life, when all that we need to make us really happy is
something to be enthusiastic about."(Charles Kingsley)
--
No that should be quite all-right. Thanks for testing!
--
The problem is fixed for me as well with the previous patch + the patch
below, VKTM now enters S state and Oracle shuts down properly again.--alessandro
"We act as though comfort and luxury were the chief requirements
of life, when all that we need to make us really happy is
something to be enthusiastic about."(Charles Kingsley)
--
thanks alot for testing this. The scheduler queue is now looking good in
testing, will probably send a pull request to Linus later today.Ingo
--
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Heiko Carstens | Re: -mm merge plans for 2.6.23 -- sys_fallocate |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | Re: [GIT]: Networking |
| Jarek Poplawski | [PATCH iproute2] Re: HTB accuracy for high speed |
