login
Header Space

 
 

Re: Soft lockup regression from today's sched.git merge.

Previous thread: [PATCH 0/11] Generic smp_call_function() and friends by Jens Axboe on Tuesday, April 22, 2008 - 3:57 am. (67 messages)

Next thread: WARNING: at arch/x86/kernel/genapic_64.c:86 by Zdenek Kabelac on Tuesday, April 22, 2008 - 5:04 am. (6 messages)
To: <mingo@...>
Cc: <linux-kernel@...>
Date: Tuesday, April 22, 2008 - 4:59 am

The following commit:

commit 27ec4407790d075c325e1f4da0a19c56953cce23
Author: Ingo Molnar &lt;mingo@elte.hu&gt;
Date:   Thu Feb 28 21:00:21 2008 +0100

    sched: make cpu_clock() globally synchronous
    
    Alexey Zaytsev reported (and bisected) that the introduction of
    cpu_clock() in printk made the timestamps jump back and forth.
    
    Make cpu_clock() more reliable while still keeping it fast when it's
    called frequently.
    
    Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;

causes watchdog triggers when a cpu exits NOHZ state when it has been
there for &gt;= the soft lockup threshold, for example here are some
messages from a 128 cpu Niagara2 box:

[  168.106406] BUG: soft lockup - CPU#11 stuck for 128s! [dd:3239]
[  168.989592] BUG: soft lockup - CPU#21 stuck for 86s! [swapper:0]
[  168.999587] BUG: soft lockup - CPU#29 stuck for 91s! [make:4511]
[  168.999615] BUG: soft lockup - CPU#2 stuck for 85s! [swapper:0]
[  169.020514] BUG: soft lockup - CPU#37 stuck for 91s! [swapper:0]
[  169.020514] BUG: soft lockup - CPU#45 stuck for 91s! [sh:4515]
[  169.020515] BUG: soft lockup - CPU#69 stuck for 92s! [swapper:0]
[  169.020515] BUG: soft lockup - CPU#77 stuck for 92s! [swapper:0]
[  169.020515] BUG: soft lockup - CPU#61 stuck for 92s! [swapper:0]
[  169.112554] BUG: soft lockup - CPU#85 stuck for 92s! [swapper:0]
[  169.112554] BUG: soft lockup - CPU#101 stuck for 92s! [swapper:0]
[  169.112554] BUG: soft lockup - CPU#109 stuck for 92s! [swapper:0]
[  169.112554] BUG: soft lockup - CPU#117 stuck for 92s! [swapper:0]
[  169.171483] BUG: soft lockup - CPU#40 stuck for 80s! [dd:3239]
[  169.331483] BUG: soft lockup - CPU#13 stuck for 86s! [swapper:0]
[  169.351500] BUG: soft lockup - CPU#43 stuck for 101s! [dd:3239]
[  169.531482] BUG: soft lockup - CPU#9 stuck for 129s! [mkdir:4565]
[  169.595754] BUG: soft lockup - CPU#20 stuck for 93s! [swapper:0]
[  169.626787] BUG: soft lockup - CPU#52 stuck for 93s! [swapper:0]
[  169.626787] BUG: soft lockup - CPU#...
To: David Miller <davem@...>
Cc: <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 5:14 am

thanks for reporting it. I havent seen this false positive happen in a 
long time - but then again, PC CPUs are a lot less idle than a 128-CPU 
Niagara2 :-/ I'm wondering what the best method would be to provoke a 
CPU to stay idle that long - to make sure this bug is fixed.

so i only have the untested patch below for now - does it fix the bug 
for you?

	Ingo

-----------------------------------&gt;
Subject: softlockup: fix NOHZ wakeup
From: Ingo Molnar &lt;mingo@elte.hu&gt;

David Miller reported:

|---------------&gt;
the following commit:

| commit 27ec4407790d075c325e1f4da0a19c56953cce23
| Author: Ingo Molnar &lt;mingo@elte.hu&gt;
| Date:   Thu Feb 28 21:00:21 2008 +0100
|
|     sched: make cpu_clock() globally synchronous
|
|     Alexey Zaytsev reported (and bisected) that the introduction of
|     cpu_clock() in printk made the timestamps jump back and forth.
|
|     Make cpu_clock() more reliable while still keeping it fast when it's
|     called frequently.
|
|     Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;

causes watchdog triggers when a cpu exits NOHZ state when it has been
there for &gt;= the soft lockup threshold, for example here are some
messages from a 128 cpu Niagara2 box:

[  168.106406] BUG: soft lockup - CPU#11 stuck for 128s! [dd:3239]
[  168.989592] BUG: soft lockup - CPU#21 stuck for 86s! [swapper:0]
[  168.999587] BUG: soft lockup - CPU#29 stuck for 91s! [make:4511]
[  168.999615] BUG: soft lockup - CPU#2 stuck for 85s! [swapper:0]
[  169.020514] BUG: soft lockup - CPU#37 stuck for 91s! [swapper:0]
[  169.020514] BUG: soft lockup - CPU#45 stuck for 91s! [sh:4515]
[  169.020515] BUG: soft lockup - CPU#69 stuck for 92s! [swapper:0]
[  169.020515] BUG: soft lockup - CPU#77 stuck for 92s! [swapper:0]
[  169.020515] BUG: soft lockup - CPU#61 stuck for 92s! [swapper:0]
[  169.112554] BUG: soft lockup - CPU#85 stuck for 92s! [swapper:0]
[  169.112554] BUG: soft lockup - CPU#101 stuck for 92s! [swapper:0]
[  169.112554] BUG: soft lockup...
To: <mingo@...>
Cc: <linux-kernel@...>, <tglx@...>
Date: Wednesday, April 23, 2008 - 1:42 am

From: Ingo Molnar &lt;mingo@elte.hu&gt;

I looked more closely at this.

There is no way the patch in question can work properly.

The algorithm is, essentialy "if time - prev_cpu_time is large enough,
call __sync_cpu_clock()" which if fine, except that nothing ever sets
prev_cpu_time.

The code is fatally flawed, once __sync_cpu_clock() calls start
happening, they will happen on every cpu_clock() call.

So like my bisect showed from the get-go, these cpu_clock() changes
have major problems, so it was quite a mind boggling stretch to stick
a touch_softlockup_watchdog() call somewhere to try and fix this

Furthermore, this is an extremely expensive way to ensure monotonic
per-rq timestamps.  A global spinlock taken every 100000 ns on every
cpu?!?!  :-/

At least move any implication of "high speed" from the comments above
cpu_clock() if we're going to need something like this.  I have 128
cpus, that's 128 grabs of that spinlock every quantum.  My next system
I'm getting will have 256 cpus.  The expense of your solution
increases linearly with the number of cpus, which doesn't scale.

Anyways, I'll work on the group sched lockup bug next.  As if I have
nothing better to do during the merge window than fix sched tree
regressions :-(
--
To: David Miller <davem@...>
Cc: <linux-kernel@...>, <tglx@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, April 23, 2008 - 5:40 am

we are glad about your feedback and we will fix any and all bugs in this 
code, as fast as we can. Let me also defend the code as there are two 
factors that you might not be aware of:

Firstly, cpu_clock() is only used in debug code, not in a fastpath. 
Using a global lock there was a conscious choice, it's been in -mm for 
two months and in linux-next as well for some time. I booted it on a 
64-way box with nohz enabled and it wasnt a visible problem in 
benchmarks.

The time_sync_thresh thing was indeed an afterthought and it was indeed 
buggy but harmless to functionality. (that's why it went unnoticed - 
until you found the thinko - thanks for that.) The patches i sent today 
should largely address that.

Secondly, and perhaps more importantly, the nohz code _already_ uses a 
very heavy lock in the idle path and always did that! It is exactly the 
kind of global lock you complain about above, just much worse: it's used 
in the fastpath.

Every time irq_enter() is done on an idle CPU we do:

|  static void tick_do_update_jiffies64(ktime_t now)
|  {
|          unsigned long ticks = 0;
|          ktime_t delta;
|
|          /* Reevalute with xtime_lock held */
|          write_seqlock(&amp;xtime_lock);

... that xtime_lock is a heavy global seqlock - just as heavy as any 
global spinlock. That's the price we pay for global jiffies and shoddy 
drivers that rely on it.

This has been there in the fastpath unconditionally (not in debug code) 
and it was not reported as a problem before. The reason is that while 
this global lock truly sucks it spends time from an _idle_ CPU's time 
which mutes some of its overhead and makes it mostly "invisible" to any 
real performance measurement.

It's still not ideal because it slightly increases irq latency. So it 
would be nice to improve upon it but it's nothing new and it's not 
caused by these commits. Any ideas how to do it? I guess we could first 
check jiffies unlocked - this would mute much of the polling that goes 
on h...
To: David Miller <davem@...>
Cc: <linux-kernel@...>, <tglx@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, April 23, 2008 - 3:51 am

you are right - and this causes us to hit that global spinlock every 
time cpu_clock() is called. Note that only debugging code uses 
cpu_clock() though (softlockup watchdog, blktrace, etc.) - but you are 
right that the performance bug should be fixed - the patch below fixes 
the bogosity.

the second patch below makes time_sync_thresh available to architecture 
code - that way, if your platform has a guaranteed-synchronous 
sched_clock(), you can set that to a larger value (or just -1LL for 
infinite).

this problem cannot explain the softlockup bug though.

	Ingo

----------------------------&gt;
Subject: sched: fix cpu clock
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Wed Apr 23 09:24:06 CEST 2008

David Miller pointed it out that nothing in cpu_clock() sets
prev_cpu_time. This caused __sync_cpu_clock() to be called
all the time - against the intention of this code.

The result was that in practice we hit a global spinlock every
time cpu_clock() is called - which - even though cpu_clock()
is used for tracing and debugging, is suboptimal.

Reported-by: David Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 kernel/sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1001,6 +1001,8 @@ unsigned long long notrace cpu_clock(int
 	if (unlikely(delta_time &gt; time_sync_thresh))
 		time = __sync_cpu_clock(time, cpu);
 
+	per_cpu(prev_cpu_time, cpu) = time;
+
 	return time;
 }
 EXPORT_SYMBOL_GPL(cpu_clock);

Subject: sched: make clock sync tunable by architecture code
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Wed Apr 23 09:31:35 CEST 2008

make time_sync_thresh tunable to architecture code.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 include/linux/sched.h |    2 ++
 kernel/sched.c        |    2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

I...
To: David Miller <davem@...>
Cc: <mingo@...>, <linux-kernel@...>, <tglx@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>, Aneesh Kumar KV <aneesh.kumar@...>
Date: Wednesday, April 23, 2008 - 3:32 am

I've been able to reproduce this on a 128way ppc64 box here. Let me see
if I can get you a fix soon enough.

-- 
regards,
Dhaval
--
To: <mingo@...>
Cc: <linux-kernel@...>, <tglx@...>
Date: Tuesday, April 22, 2008 - 6:05 am

From: Ingo Molnar &lt;mingo@elte.hu&gt;

The NOHZ lockup warnings are gone.  But this seems like
a band-aid.  We made sure that cpus don't get into this
state via commit:

----------------------------------------
commit d3938204468dccae16be0099a2abf53db4ed0505
Author: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Date:   Wed Nov 28 15:52:56 2007 +0100

    softlockup: fix false positives on CONFIG_NOHZ
    
    David Miller reported soft lockup false-positives that trigger
    on NOHZ due to CPUs idling for more than 10 seconds.
    
    The solution is touch the softlockup watchdog when we return from
    idle. (by definition we are not 'locked up' when we were idle)
    
     http://bugzilla.kernel.org/show_bug.cgi?id=9409
    
    Reported-by: David Miller &lt;davem@davemloft.net&gt;
    Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
    Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 27a2338..cb89fa8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -133,6 +133,8 @@ void tick_nohz_update_jiffies(void)
 	if (!ts-&gt;tick_stopped)
 		return;
 
+	touch_softlockup_watchdog();
+
 	cpu_clear(cpu, nohz_cpu_mask);
 	now = ktime_get();
----------------------------------------

While what the guilty patch we're discussing here does is change how
cpu_clock() is computed, that's it.  softlockup uses cpu_clock() to
calculate it's timestamp.  The guilty change modified nothing about
when touch_softlockup_watchdog() is called, nor any other aspect about
how the softlockup mechanism itself works.

So we need to figure out why in the world changing how cpu_clock()
gets calculated makes a difference.

Anyways, this is with HZ=1000 FWIW.  And I really don't feel this is a
128-cpu moster system thing, I bet my 2-cpu workstation triggers this
too, and I'll make sure of that for you..

BTW, I'm also getting cpu's wedged in the group aggregate code:

[  121.338742] TSTATE: 0000...
To: David Miller <davem@...>
Cc: <linux-kernel@...>, <tglx@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, April 23, 2008 - 4:50 am

so the patch i gave you first should do the trick - the cleaner patch is 

the thing is that that change _fixed_ cpu_clock() - and exposed the 
latent softlockup/nohz interaction that was always there and which 
caused false positives if a CPU stayed idle for more than 60 seconds.

	Ingo

------------------&gt;
Subject: softlockup: fix false positives on nohz if CPU is 100% idle for more than 60 seconds
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Wed Apr 23 10:01:08 CEST 2008

David Miller reported softlockup false positives on his 128-way Niagara2 
system. The special thing about that system is extremely long clockevent 
timeouts combined with extremly long idle times.

Fix rq-&gt;clock update bug that can trigger on such a system: in 
tick_nohz_update_jiffies() [which is called on all irq entry on all cpus 
where the irq entry hits an idle cpu] we call 
touch_softlockup_watchdog() before we update jiffies. That works fine 
most of the time when idle timeouts are within 60 seconds. But when an 
idle timeout is beyond 60 seconds, jiffies is updated with a jump of 
more than 60 seconds, which causes a jump in cpu-clock of more than 60 
seconds, triggering a false positive.

Reported-by: David Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 kernel/time/tick-sched.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -133,8 +133,6 @@ void tick_nohz_update_jiffies(void)
 	if (!ts-&gt;tick_stopped)
 		return;
 
-	touch_softlockup_watchdog();
-
 	cpu_clear(cpu, nohz_cpu_mask);
 	now = ktime_get();
 	ts-&gt;idle_waketime = now;
@@ -142,6 +140,8 @@ void tick_nohz_update_jiffies(void)
 	local_irq_save(flags);
 	tick_do_update_jiffies64(now);
 	local_irq_restore(flags);
+
+	touch_softlockup_watchdog();
 }
 
 void tick_nohz_stop_id...
To: <mingo@...>
Cc: <linux-kernel@...>, <tglx@...>, <a.p.zijlstra@...>
Date: Wednesday, April 23, 2008 - 6:55 am

From: Ingo Molnar &lt;mingo@elte.hu&gt;

Even with this patch applied and the second buggy changeset:

commit 15934a37324f32e0fda633dc7984a671ea81cd75
Author: Guillaume Chazarain &lt;guichaz@yahoo.fr&gt;
Date:   Sat Apr 19 19:44:57 2008 +0200

    sched: fix rq-&gt;clock overflows detection with CONFIG_NO_HZ
    
    When using CONFIG_NO_HZ, rq-&gt;tick_timestamp is not updated every TICK_NSEC.
    We check that the number of skipped ticks matches the clock jump seen in
    __update_rq_clock().
    
    Signed-off-by: Guillaume Chazarain &lt;guichaz@yahoo.fr&gt;
    Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;

reverted, I'm still getting softlockup warnings and wedged cpus, see
the log below.

Look, I'm just going to walk through the scheduler and cpumask
changesets one by one and report as each regression appears
and try to continue with the guilty changesets reported.  There
are aparently several bugs being added by these changesets.

It may take some time, as each test run the verify the existence
of the problem takes several minutes.

[  111.877610] TSTATE: 0000000011001601 TPC: 000000000042782c TNPC: 0000000000427830 Y: 00000000    Not tainted
[  111.877610] TPC: &lt;cpu_idle+0xb0/0x134&gt;
[  111.877610] g0: 0000000000000000 g1: 0000000000000016 g2: fffff803ff678000 g3: 0000000000000000
[  111.877610] g4: fffff803ff647100 g5: fffff80007e4a000 g6: fffff803ff678000 g7: 0000021100004000
[  111.877610] o0: 0000000000000000 o1: fffff803ff678008 o2: 0000000000004000 o3: 0000000000000001
[  111.877610] o4: fffff8000863a2d0 o5: 0000000000000012 sp: fffff803ff67b681 ret_pc: 0000000000427818
[  111.877610] RPC: &lt;cpu_idle+0x9c/0x134&gt;
[  111.877610] l0: 000000000000003d l1: 00000000007bb6d0 l2: 000000000000003d l3: 00000000000000c8
[  111.877610] l4: 00000000feb51cc7 l5: 00000000feb51c7f l6: 0000000000000001 l7: 0000000000000000
[  111.877610] i0: 0000000000000000 i1: 0000000000000016 i2: 0000000000001000 i3: 00000000f025bdfc
[  111.877610] i4: 00000000fee81b18 i5: 00000...
To: <mingo@...>
Cc: <linux-kernel@...>, <tglx@...>, <a.p.zijlstra@...>
Date: Wednesday, April 23, 2008 - 8:29 am

From: David Miller &lt;davem@davemloft.net&gt;

Ok, Ingo, none of your patches fix even the initial buggy
changeset, for reference:

commit 27ec4407790d075c325e1f4da0a19c56953cce23
Author: Ingo Molnar &lt;mingo@elte.hu&gt;
Date:   Thu Feb 28 21:00:21 2008 +0100

    sched: make cpu_clock() globally synchronous
    
    Alexey Zaytsev reported (and bisected) that the introduction of
    cpu_clock() in printk made the timestamps jump back and forth.
    
    Make cpu_clock() more reliable while still keeping it fast when it's
    called frequently.
    
    Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;

I checked out a tree to the changeset before this one, just
to double check, and there are no problems.

I add that changeset and I get softlockup warnings like crazy
in my logs.

I added your "move touch_softlockup_watchdog() earlier in
tick_nohz_update_jiffies()" patch:

--------------------
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -133,8 +133,6 @@ void tick_nohz_update_jiffies(void)
 	if (!ts-&gt;tick_stopped)
 		return;
 
-	touch_softlockup_watchdog();
-
 	cpu_clear(cpu, nohz_cpu_mask);
 	now = ktime_get();
 	ts-&gt;idle_waketime = now;
@@ -142,6 +140,8 @@ void tick_nohz_update_jiffies(void)
 	local_irq_save(flags);
 	tick_do_update_jiffies64(now);
 	local_irq_restore(flags);
+
+	touch_softlockup_watchdog();
 }
 
 void tick_nohz_stop_idle(int cpu)
--------------------

and still I get mountains of softlockup messages, see first
attachment, below.

I then added your patch, just to make sure, which adds the
missing prev_cpu_time assignment, specifically:

--------------------
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1001,6 +1001,8 @@ unsigned long long notrace cpu_clock(int
 	if (unlikely(delta_time &gt; time_sync_thresh))
 		time = __sync_cpu_clock(time, cpu);
 
+	per_cpu(prev_cpu_time, ...
To: David Miller <davem@...>
Cc: <linux-kernel@...>, <tglx@...>, <a.p.zijlstra@...>
Date: Wednesday, April 23, 2008 - 9:36 am

as a temporary workaround please try the patch below, until we can 
reproduce and fix the bug.

	Ingo

-------------------------------&gt;
Subject: softlockup: nohz workaround
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Wed Apr 23 15:19:32 CEST 2008

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 kernel/softlockup.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/softlockup.c
===================================================================
--- linux.orig/kernel/softlockup.c
+++ linux/kernel/softlockup.c
@@ -245,7 +245,7 @@ static int watchdog(void *__bind_cpu)
 	 */
 	while (!kthread_should_stop()) {
 		touch_softlockup_watchdog();
-		schedule();
+		schedule_timeout(softlockup_thresh/2);
 
 		if (kthread_should_stop())
 			break;
--
To: <mingo@...>
Cc: <linux-kernel@...>, <tglx@...>, <a.p.zijlstra@...>
Date: Wednesday, April 23, 2008 - 7:23 pm

From: Ingo Molnar &lt;mingo@elte.hu&gt;

Yeah, if you basically turn off the code paths, that particular set of
problems goes away :-/

So then we're at the next bug, cpus getting wedged in the group
aggregate code.

I'll try Peter's patches which were posted today.

[  760.218048] BUG: soft lockup - CPU#5 stuck for 61s! [swapper:0]
[  760.218292] TSTATE: 0000000080001603 TPC: 000000000054e0c0 TNPC: 000000000054e0c4 Y: 00000000    Not tainted
[  760.218325] TPC: &lt;find_next_bit+0xe4/0x11c&gt;
[  760.218336] g0: 0000000000009000 g1: 0000000000000000 g2: ffffffffffffffff g3: 0000000000000030
[  760.218352] g4: fffff803ff0d5880 g5: fffff80007c8a000 g6: fffff803ff0ec000 g7: 00000000007bb6d0
[  760.218368] o0: 000000000000fff0 o1: 0000000000000040 o2: 0000000000000034 o3: 0000000000000000
[  760.218383] o4: 0000000100009332 o5: 0000000000000000 sp: fffff803ff0eee21 ret_pc: 000000000054de08
[  760.218402] RPC: &lt;__next_cpu+0x18/0x2c&gt;
[  760.218413] l0: 00000000007f0000 l1: 0000009980001602 l2: 0000000000455d2c l3: 0000000000000400
[  760.218428] l4: 0000000000000000 l5: 0000000000000002 l6: 0000000000000000 l7: 0000000000000008
[  760.218443] i0: 0000000000000033 i1: 00000000007bb6c8 i2: 0000000000000038 i3: fffff803f73bf100
[  760.218459] i4: 0000000000845000 i5: 0000000000000401 i6: fffff803ff0eeee1 i7: 0000000000455d48
[  760.218487] I7: &lt;aggregate_group_shares+0x10c/0x16c&gt;
[  823.716459] INFO: task collect2:4106 blocked for more than 120 seconds.
[  823.716680] "echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  823.716815] collect2      D 00000000006b4a80     0  4106   4105
[  823.716831] Call Trace:
[  823.716839]  [00000000006b4c40] schedule_timeout+0x20/0xa4
[  823.716859]  [00000000006b4a80] wait_for_common+0xf4/0x184
[  823.716875]  [000000000045f2cc] do_fork+0x1dc/0x234
[  823.716894]  [0000000000406214] linux_sparc_syscall32+0x3c/0x40
[  823.716917]  [0000000000023f50] 0x23f58

--
To: David Miller <davem@...>
Cc: <mingo@...>, <linux-kernel@...>, <tglx@...>
Date: Tuesday, April 22, 2008 - 8:45 am

Sadly both the cpumask changes and the group load-balancer came in the
same batch - so its all new code. Also, it looks like the cpumask
changes are before the load balance changes - so bisecting this will be
'fun'.

That said; the code in question looks like:

static
void aggregate_group_shares(struct task_group *tg, struct sched_domain *sd)
{
        unsigned long shares = 0;
        int i;

again:
        for_each_cpu_mask(i, sd-&gt;span)
                shares += tg-&gt;cfs_rq[i]-&gt;shares;

        /*
         * When the span doesn't have any shares assigned, but does have
         * tasks to run do a machine wide rebalance (should be rare).
         */
        if (unlikely(!shares &amp;&amp; aggregate(tg, sd)-&gt;rq_weight)) {
                __aggregate_redistribute_shares(tg);
                goto again;
        }

        aggregate(tg, sd)-&gt;shares = shares;
}

and the __first_cpu() call is from the for_each_cpu_mask() afaict. and
sd-&gt;span should be good - that's not new. So I'm a bit puzzled.

But you say they get wedged - so the above trace is a snapshot (NMI,
sysrq-[tw] or the like) that could also mean they get wedged in this
'again' loop we have here.

I have two patches; the first will stick in a printk() to see if it is
indeed the loop in this function. The second is an attempt at breaking
out of it.

BTW, what does the sched_domain tree look like on that 128-cpu machine?

---
 kernel/printk.c |    2 --
 kernel/sched.c  |    1 +
 2 files changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6-2/kernel/printk.c
===================================================================
--- linux-2.6-2.orig/kernel/printk.c
+++ linux-2.6-2/kernel/printk.c
@@ -1020,8 +1020,6 @@ void release_console_sem(void)
 	console_locked = 0;
 	up(&amp;console_sem);
 	spin_unlock_irqrestore(&amp;logbuf_lock, flags);
-	if (wake_klogd)
-		wake_up_klogd();
 }
 EXPORT_SYMBOL(release_console_sem);
 
Index: linux-2.6-2/kernel/sched.c
========================...
To: Peter Zijlstra <peterz@...>
Cc: David Miller <davem@...>, <mingo@...>, <linux-kernel@...>, <tglx@...>
Date: Tuesday, May 6, 2008 - 6:41 pm

Can you please tell me what the current status of this is?

Rafael
--
To: <rjw@...>
Cc: <peterz@...>, <mingo@...>, <linux-kernel@...>, <tglx@...>
Date: Tuesday, May 6, 2008 - 7:05 pm

From: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;

Group scheduling bug, fixed in Linus's tree already.
--
To: David Miller <davem@...>
Cc: <rjw@...>, <peterz@...>, <linux-kernel@...>, <tglx@...>
Date: Wednesday, May 7, 2008 - 2:43 am

fixed by:

 commit 3f5087a2bae5d1ce10a3d698dec8f879a96f5419
 Author: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
 Date:   Fri Apr 25 00:25:08 2008 +0200

     sched: fix share (re)distribution

     fix __aggregate_redistribute_shares() related lockup reported by
     David S. Miller.
 [...]

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: David Miller <davem@...>, <peterz@...>, <linux-kernel@...>, <tglx@...>
Date: Wednesday, May 7, 2008 - 2:56 pm

--
Previous thread: [PATCH 0/11] Generic smp_call_function() and friends by Jens Axboe on Tuesday, April 22, 2008 - 3:57 am. (67 messages)

Next thread: WARNING: at arch/x86/kernel/genapic_64.c:86 by Zdenek Kabelac on Tuesday, April 22, 2008 - 5:04 am. (6 messages)
speck-geostationary