Re: 2.6.27rc1 cannot boot more than 8CPUs

Previous thread: resume problem with i915/drm by Charles on Tuesday, August 5, 2008 - 10:25 pm. (5 messages)

Next thread: BUG: scheduling while atomic: swapper/1/0x00000002 in 2.6.27-rc1-tip by Jaswinder Singh on Tuesday, August 5, 2008 - 11:15 pm. (2 messages)
To: lkml <linux-kernel@...>
Date: Tuesday, August 5, 2008 - 11:15 pm

I've a Dell R900 with 4 quad-core Xeon processors (total 16CPUs), but
can only managed to boot up with CONFIG_NR_CPUS=8. Setting
CONFIG_NR_CPUS=16 causes the kernel to hang while booting.

Here's the dmesg with CONFIG_NR_CPUS=8 ...

CPU6: Intel(R) Xeon(R) CPU L7345 @ 1.86GHz stepping 0b
checking TSC synchronization [CPU#0 -> CPU#6]: passed.
CPU 7 irqstacks, hard=c0526000 soft=c051e000
Booting processor 7/26 ip 6000
Initializing CPU#7
Calibrating delay using timer specific routine.. 3723.85 BogoMIPS (lpj=7447700)
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 4096K
CPU: Physical Processor ID: 6
CPU: Processor Core ID: 2
x86 PAT enabled: cpu 7, old 0x7040600070406, new 0x7010600070106
CPU7: Intel(R) Xeon(R) CPU L7345 @ 1.86GHz stepping 0b
checking TSC synchronization [CPU#0 -> CPU#7]: passed.
Brought up 8 CPUs
Total of 8 processors activated (29790.71 BogoMIPS).
net_namespace: 596 bytes
Booting paravirtualized kernel on bare hardware
NET: Registered protocol family 16

Here's the dmesg with CONFIG_NR_CPUS=16 ...
Booting processor 8/1 ip 6000
Initializing CPU#8
Calibrating delay using timer specific routine.. 3723.85 BogoMIPS (lpj=7447793)
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 4096K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
x86 PAT enabled: cpu 7, old 0x7040600070406, new 0x7010600070106
CPU8: Intel(R) Xeon(R) CPU L7345 @ 1.86GHz stepping 0b
checking TSC synchronization [CPU#0 -> CPU#8]: passed.
*** Hangs here ***

How can I debug this further? I'm using the latest linux git pull.

Thanks,
Jeff.
--

To: Jeff Chua <jeff.chua.linux@...>
Cc: lkml <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 2:01 am

One trivial thing to try would be to just bisect it. I assume 2.6.26 is
fine, so while it will take a few boots to try it out (there's 8111
commits in between, so 13 reboots should do it), the advantage of
bisection is that it's fairly straightforward to do even if you don't have
any clue where the problem might lurk.

And with your machine, recompiling the kernel 13 times shouldn't take that
long ;)

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: lkml <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 2:42 am

On Wed, Aug 6, 2008 at 2:01 PM, Linus Torvalds

Bisecting now.

Jeff.
--

To: Linus Torvalds <torvalds@...>, Yinghai Lu <yhlu.kernel@...>, David Miller <davem@...>, Max Krasnyansky <maxk@...>, Li Zefan <lizf@...>
Cc: lkml <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 11:33 am

Thanks to all the great helpful suggestions from everyone, and this
turns out that I just need to enable the following switches, so I
didn't bisect further, and since it's first machine that I've tried
with more than 8 CPUs so I wasn't sure whether 2.6.16 has the same

Thank you all for the great linux kernel!

Jeff.
--

To: Jeff Chua <jeff.chua.linux@...>
Cc: Linus Torvalds <torvalds@...>, Yinghai Lu <yhlu.kernel@...>, David Miller <davem@...>, Max Krasnyansky <maxk@...>, Li Zefan <lizf@...>, lkml <linux-kernel@...>
Date: Monday, August 11, 2008 - 3:59 pm

i still consider a silent boot hang a bug we need to fix.

bigsmp might be required to have all cpus available on your box, but the
kernel is still supposed to transparently fall back to less CPUs (and
print a warning) if it cannot do that.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Jeff Chua <jeff.chua.linux@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, Max Krasnyansky <maxk@...>, Li Zefan <lizf@...>, lkml <linux-kernel@...>
Date: Monday, August 11, 2008 - 4:03 pm

in setup.c::setup_arch() after go over with madt or mptable

#if defined(CONFIG_SMP) && defined(CONFIG_X86_PC) && defined(CONFIG_X86_32)
if (def_to_bigsmp)
printk(KERN_WARNING "More than 8 CPUs detected and "
"CONFIG_X86_PC cannot handle it.\nUse "
"CONFIG_X86_GENERICARCH or
CONFIG_X86_BIGSMP.\n"); ===> here need to change "or" to "and"
#endif

or just panic here? because screen scroll to pass it, and user will
not notice that...

YH
--

To: Yinghai Lu <yhlu.kernel@...>
Cc: Jeff Chua <jeff.chua.linux@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, Max Krasnyansky <maxk@...>, Li Zefan <lizf@...>, lkml <linux-kernel@...>
Date: Monday, August 11, 2008 - 4:08 pm

a panic is better but still quite rude and doesnt give a user a system
under which he can build an even greater kernel [after having discovered
the warning in the syslog] ;-)

best would be to use as many CPUs as we can support, and skip the rest
and boot up fine. (and print the warning prominently - the user does not
make maximum use of available physical resources)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Jeff Chua <jeff.chua.linux@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, Max Krasnyansky <maxk@...>, Li Zefan <lizf@...>, lkml <linux-kernel@...>
Date: Monday, August 11, 2008 - 4:12 pm

then smp start AP cpu could check the apic id >=8 etc before try to
start it.in some cases

YH
--

To: Ingo Molnar <mingo@...>
Cc: Jeff Chua <jeff.chua.linux@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, Max Krasnyansky <maxk@...>, Li Zefan <lizf@...>, lkml <linux-kernel@...>
Date: Monday, August 11, 2008 - 4:36 pm

please check attach patches..

YH

To: Yinghai Lu <yhlu.kernel@...>
Cc: Jeff Chua <jeff.chua.linux@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, Max Krasnyansky <maxk@...>, Li Zefan <lizf@...>, lkml <linux-kernel@...>
Date: Monday, August 11, 2008 - 4:44 pm

applied to tip/x86/urgent - thanks Yinghai. While we are touching this
code i cleaned up the printk a bit: the line breaking was way too ugly,
and the message not very informative about the effects of this problem.
See the full commit below.

Ingo

--------------->
From b74548e76a0eab1f29546e7c5a589429c069a680 Mon Sep 17 00:00:00 2001
From: Yinghai Lu <yhlu.kernel@gmail.com>
Date: Mon, 11 Aug 2008 13:36:04 -0700
Subject: [PATCH] x86: fix 2.6.27rc1 cannot boot more than 8CPUs

Jeff Chua reported that booting a !bigsmp kernel on a 16-way box
hangs silently.

this is a long-standing issue, smp start AP cpu could check the
apic id >=8 etc before trying to start it.

achieve this by moving the def_to_bigsmp check later and skip the
apicid id > 8

[ mingo@elte.hu: clean up the message that is printed. ]

Reported-by: "Jeff Chua" <jeff.chua.linux@gmail.com>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

arch/x86/kernel/setup.c | 6 ------
arch/x86/kernel/smpboot.c | 10 ++++++++++
2 files changed, 10 insertions(+), 6 deletions(-)
---
arch/x86/kernel/setup.c | 6 ------
arch/x86/kernel/smpboot.c | 10 ++++++++++
2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6e5823b..68b48e3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -861,12 +861,6 @@ void __init setup_arch(char **cmdline_p)
init_apic_mappings();
ioapic_init_mappings();

-#if defined(CONFIG_SMP) && defined(CONFIG_X86_PC) && defined(CONFIG_X86_32)
- if (def_to_bigsmp)
- printk(KERN_WARNING "More than 8 CPUs detected and "
- "CONFIG_X86_PC cannot handle it.\nUse "
- "CONFIG_X86_GENERICARCH or CONFIG_X86_BIGSMP.\n");
-#endif
kvm_guest_init();

e820_reserve_resources();
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index da10f07..91055d7 100644
--- a/arch/x86/kernel/smp...

To: <jeff.chua.linux@...>
Cc: <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 1:19 am

From: "Jeff Chua" <jeff.chua.linux@gmail.com>

Do you have lockdep enabled? If sure, try turning that off.
--

To: David Miller <davem@...>
Cc: <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 2:42 am

It's enabled by default, and I can't seem to disable it even if I
commented it out or delete it, it comes back after running "make".

CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y

Thanks,
Jeff.
--

To: Jeff Chua <jeff.chua.linux@...>
Cc: David Miller <davem@...>, <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 4:49 am

do you have

CONFIG_X86_GENERICARCH=y
CONFIG_X86_BIGSMP=y

8 more cpu need bigsmp mode.

YH
--

To: Yinghai Lu <yhlu.kernel@...>
Cc: David Miller <davem@...>, <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 5:35 am

Are these the ones that supposed to be set? Any, can't find a place to
set these using menuconfig.

# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_VSMP is not set

Thanks
Jeff.
--

To: Yinghai Lu <yhlu.kernel@...>
Cc: David Miller <davem@...>, <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 5:42 am

Sorry, found it. These are not obvious. I had selected
"Subarchitecture Type (PC-compatible)" and could find a place to set
CONFIG_X86_GENERICARCH.

Just found it under " Subarchitecture Type (Generic architecture)",
and then it shows the CONFIG_X86_BIGSMP option.

Ok, compiling and testing now.

Jeff.
--

To: <jeff.chua.linux@...>
Cc: <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 3:18 am

From: "Jeff Chua" <jeff.chua.linux@gmail.com>

You have to turn off CONFIG_PROVE_LOCKING, in fact just turn off
everything in the lock debugging section:

# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
--

To: David Miller <davem@...>
Cc: <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 5:33 am

I don't any option to turn these off. Still searching.

Jeff.
--

To: <jeff.chua.linux@...>
Cc: <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 5:36 am

From: "Jeff Chua" <jeff.chua.linux@gmail.com>

Maybe edit the ".config" file at the top level of the kernel
sources and then type "make oldconfig" ?!?!?!

--

To: David Miller <davem@...>
Cc: <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 5:50 am

Ok, may be not as bad as I thought. These are not in .config meaning,

Thanks,
Jeff.
--

To: Jeff Chua <jeff.chua.linux@...>
Cc: lkml <linux-kernel@...>
Date: Tuesday, August 5, 2008 - 11:31 pm

You could try booting CONFIG_NR_CPUS=16 with maxcpus=8 (kernel command line
option).

If it boots you can then try bringing the rest of the cpus online manually
echo 1 > /sys/devices/system/cpu/cpu8/online
...
echo 1 > /sys/devices/system/cpu/cpu15/online

Might get a better OOPS/BUG_ON/etc report.

Max
--

To: Max Krasnyansky <maxk@...>
Cc: lkml <linux-kernel@...>
Date: Tuesday, August 5, 2008 - 11:50 pm

Ok, booted with CONFIG_NR_CPUS=16 with maxcpus=8, but can't find the
rest of the CPUs.

I can only find cpu0 to cpu7 ...

drwxr-xr-x 4 root root 0 Aug 6 19:38 cpu0
drwxr-xr-x 4 root root 0 Aug 6 19:38 cpu1
drwxr-xr-x 4 root root 0 Aug 6 19:38 cpu2
drwxr-xr-x 4 root root 0 Aug 6 19:38 cpu3
drwxr-xr-x 4 root root 0 Aug 6 19:38 cpu4
drwxr-xr-x 4 root root 0 Aug 6 19:38 cpu5
drwxr-xr-x 4 root root 0 Aug 6 19:38 cpu6
drwxr-xr-x 4 root root 0 Aug 6 19:38 cpu7
-r--r--r-- 1 root root 4096 Aug 6 19:41 online
-r--r--r-- 1 root root 4096 Aug 6 19:39 possible
-r--r--r-- 1 root root 4096 Aug 6 19:38 present
-rw-r--r-- 1 root root 4096 Aug 6 19:38 sched_mc_power_savings

# cat online
0-7
# cat possible
0-23
# cat present
0-7

Thanks,
Jeff.
--

To: Jeff Chua <jeff.chua.linux@...>
Cc: lkml <linux-kernel@...>
Date: Tuesday, August 5, 2008 - 11:54 pm

Are you running 32-bit kernel ?

Max

--

To: Max Krasnyansky <maxk@...>
Cc: lkml <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 12:06 am

Yes. But, does it matter?

Thanks,
Jeff.
--

To: Jeff Chua <jeff.chua.linux@...>
Cc: lkml <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 12:48 am

It used to. 64-bit kernel used to handle maxcpus option as documented in the
Documentation/cpu-hotplug.txt and 32-bit one was broken.
I just looked at the latest code and realized that both are now broken. They
ignore cpu id > maxcpus instead of not-booting them.

I'll send a patch that fixes that tomorrow.

Max
--

To: Max Krasnyansky <maxk@...>
Cc: Jeff Chua <jeff.chua.linux@...>, lkml <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 12:53 am

Yes. I have an x86_64 box with 4 cpus, but yesterday when I booted up with maxcpus=2,

greate :)

--

To: Li Zefan <lizf@...>
Cc: Jeff Chua <jeff.chua.linux@...>, lkml <linux-kernel@...>
Date: Wednesday, August 6, 2008 - 4:11 pm

I just sent it and CC'ed both of you guys.
[PATCH] Resurect proper handling of maxcpus= kernel option

Jeff, maybe you can try again booting with maxcpus=8 and then bringing
them online one by one to see where/what fails.

Max

--

Previous thread: resume problem with i915/drm by Charles on Tuesday, August 5, 2008 - 10:25 pm. (5 messages)

Next thread: BUG: scheduling while atomic: swapper/1/0x00000002 in 2.6.27-rc1-tip by Jaswinder Singh on Tuesday, August 5, 2008 - 11:15 pm. (2 messages)