Re: 2.6.27rc1 cannot boot more than 8CPUs

Previous thread: [PATCH] blackfin/sram: use 'unsigned long' for irqflags by Vegard Nossum on Wednesday, August 6, 2008 - 3:58 am. (10 messages)

Next thread: Re: linux-next: Tree for August 5 (MTD build error) by David Brownell on Wednesday, August 6, 2008 - 4:13 am. (5 messages)
From: Jeff Chua
Date: Wednesday, August 6, 2008 - 4:09 am

On Wed, Aug 6, 2008 at 5:42 PM, Jeff Chua <jeff.chua.linux@gmail.com> 

It works. Booted with 16CPUs. 32GB RAM.

CPU0 L7345 1.86GHz  0C
CPU1 L7345 1.86GHz  0C
CPU2 L7345 1.86GHz  0C
CPU3 L7345 1.86GHz  0C
CPU4 L7345 1.86GHz  0C
CPU5 L7345 1.86GHz  0C
CPU6 L7345 1.86GHz  0C
CPU7 L7345 1.86GHz  0C
CPU8 L7345 1.86GHz  0C
CPU9 L7345 1.86GHz  0C
CPU10 L7345 1.86GHz  0C
CPU11 L7345 1.86GHz  0C
CPU12 L7345 1.86GHz  0C
CPU13 L7345 1.86GHz  0C
CPU14 L7345 1.86GHz  0C
CPU15 L7345 1.86GHz  0C


So, but setting the config not obvious. And should CONFIG_X86_PC be 
considered as well as CONFIG_X86_GENERICARCH?

With CONFIG_X86_PC, I can set CONFIG_SPARSEMEM=y.

With CONFIG_X86_GENERICARCH, CONFIG_SPARSEMEM depends on CONFIG_NUMA.

I'm using the patch below to enable sparsemem instead of flatmem, but 
don't know what impact it has. System booted and running.

It would be nice to automatically default CONFIG_X86_BIGSMP with CPUs > 8. 
But I don't know to do that.


Thanks,
Jeff.


--- linux/arch/x86/Kconfig.org	2008-08-06 18:41:08 +0800
+++ linux/arch/x86/Kconfig	2008-08-06 18:48:13 +0800
@@ -1035,7 +1035,7 @@

   config ARCH_FLATMEM_ENABLE
   	def_bool y
-	depends on X86_32 && ARCH_SELECT_MEMORY_MODEL && X86_PC && !NUMA
+	depends on X86_32 && ARCH_SELECT_MEMORY_MODEL && !NUMA

   config ARCH_DISCONTIGMEM_ENABLE
   	def_bool y
@@ -1051,7 +1051,7 @@

   config ARCH_SPARSEMEM_ENABLE
   	def_bool y
-	depends on X86_64 || NUMA || (EXPERIMENTAL && X86_PC)
+	depends on X86_64 || NUMA || (EXPERIMENTAL && X86_PC) || X86_GENERICARCH
   	select SPARSEMEM_STATIC if X86_32
   	select SPARSEMEM_VMEMMAP_ENABLE if X86_64

--

From: Yinghai Lu
Date: Wednesday, August 6, 2008 - 9:13 am

actually x86_pc is one mode of genericarch..., genericarch already
could detect pc, bigsmp, and numaq, es7000, bigsmp, visew..

hope later we can change mach_default to default. but embed guys may
want to keep it as seperated one.

in the dmesg when booting x86_pc only, we already have warning to let
you set bigsmp if you have 8 more cpus.

YH
--

From: Jeff Chua
Date: Wednesday, August 6, 2008 - 9:34 am

It seems to get "sparse mem", NUMA must be set first, but this is not

With more than 8 CPUs and upon boot up and hangs,  Shift+PgUp does not
work, so it's not possible to view console messages except those on
the current page, so I guess I missed that hint.

Jeff.
--

From: Ingo Molnar
Date: Monday, August 11, 2008 - 12:54 pm

thanks, applied.

i'm wondering, does with that patch applied a working 2.6.26 .config if 
put through 'make oldconfig' boot fine on your box now? Any make 
oldconfig breakage is a regression we want to fix. We want upgrades 
between kernel versions to be seemless and complete.

	Ingo
--

From: Ingo Molnar
Date: Wednesday, August 13, 2008 - 7:16 am

btw., could you please check that v2.6.27-rc3 (or later) kernels boot 
fine (with about 8 cpus) even if you hae genericarch/bigsmp disabled, 
and do not silently hang as it happened on your box before?

	Ingo
--

From: Jeff Chua
Date: Wednesday, August 13, 2008 - 10:10 am

With 16 CPUs, it still hangs, but now the console is showing the
errors as intended.
... but it is supposed to hang?

More than 8 CPUs detected - skipping them.
Use CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.
More than 8 CPUs detected - skipping them.
Use CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.
More than 8 CPUs detected - skipping them.
Use CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.
More than 8 CPUs detected - skipping them.
Use CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.
Booting processor 8/1 ip 6000
Initializing CPU#8
Calibrating delay using timer specific routine.. 3723.88 BogoMIPS (lpj=7447763)
CPU: L1 I cache: 32Kb, L1 D cache: 32K
CPU: L2 cache: 4096K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
CPU8: Intel(8) Xeon(R) CPU        L7345  @ 1.86GHz stepping 0b
checking TSC synchronization [CPU#0 -> CPU#8]: passed.
*** HANGS HERE ***

Thanks,
Jeff.
--

From: Jeff Chua
Date: Wednesday, August 13, 2008 - 10:33 am

I tried with just CONFIG_NR_CPUS=8 and this time it booted, but stange
thing is I only see 2 CPUs! To be more precise, it's without both
CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.

And when I tried to enable the CPUs, it complained about:

# cat cpu6/online
0
# echo 1 > cpu6/online
More than 8 CPUs detected - skipping them.
Use CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.
-bash: echo: write error: Input/output error

Prior to the patch, the system booted with all 8 CPUs.

Again, if I enable both CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP,
I get all 16 CPUs.

Thanks,
Jeff.
--

From: Ingo Molnar
Date: Wednesday, August 13, 2008 - 10:39 am

Yinghai, could the APIC ID enumeration be nonsequential and we skip CPUs 
starting at the third one already? I think we should accept all CPUs 
that are within our support range.

	Ingo
--

From: Yinghai Lu
Date: Wednesday, August 13, 2008 - 10:46 am

will try to clear those bits on smp_sanity_check...

YH
--

From: Yinghai Lu
Date: Wednesday, August 13, 2008 - 11:33 am

jeff,

please check the attached patch. it should fix the new regression and
will not hang.

YH
From: Jeff Chua
Date: Thursday, August 14, 2008 - 12:16 am

Ok, booted up and not hanged, but those messages below don't show up
anywhere. I've tested with CONFIG_NR_CPUS=16 and 8 as well. Just got 8
cpus

More than 8 CPUs detected - skipping them.
Use CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.

# cat /sys/devices/system/cpu/possible
0-7

CONFIG_X86_32=y
CONFIG_X86_PC=y


Looks like it's not going into this condition
+        if (def_to_bigsmp && nr_cpu_ids > 8) {


Shall this be put back so that it'll show the message?
-       if (def_to_bigsmp && apicid > 8) {
-               printk(KERN_WARNING
-                       "More than 8 CPUs detected - skipping them.\n"
-                       "Use CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.\n");
-       }


Thanks,
Jeff.
--

From: Yinghai Lu
Date: Thursday, August 14, 2008 - 1:59 am

double checked on one 16 cores system got

CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 0(4) -> Core 0
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
using C1E aware idle routine
Checking 'hlt' instruction... OK.
ACPI: Core revision 20080609
Parsing all Control Methods:
Table [DSDT](id 0001) - 1289 Objects with 114 Devices 462 Methods 26 Regions
Parsing all Control Methods:
Table [SSDT](id 0002) - 80 Objects with 0 Devices 0 Methods 0 Regions
 tbxface-0596 [00] tb_load_namespace     : ACPI Tables successfully acquired
evxfevnt-0091 [00] enable                : Transition to ACPI mode successful
More than 8 CPUs detected - skipping them.
Use CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.
enabled ExtINT on CPU#0

YH
--

From: Ingo Molnar
Date: Thursday, August 14, 2008 - 2:07 am

could you post the full dmesg? And the modified patch that you've tested 
to both have 8 CPUs without bigsmp and which also shows the printk?

	Ingo
--

Previous thread: [PATCH] blackfin/sram: use 'unsigned long' for irqflags by Vegard Nossum on Wednesday, August 6, 2008 - 3:58 am. (10 messages)

Next thread: Re: linux-next: Tree for August 5 (MTD build error) by David Brownell on Wednesday, August 6, 2008 - 4:13 am. (5 messages)