> On Sun, Nov 14, 2010 at 03:12:20AM +0800, Yinghai Lu wrote:
>> On 11/13/2010 05:10 AM, Wu Fengguang wrote:
>>> On Sat, Nov 13, 2010 at 08:57:58PM +0800, Peter Zijlstra wrote:
>>>> On Sat, 2010-11-13 at 20:00 +0800, Wu Fengguang wrote:
>>>>> On Sat, Nov 13, 2010 at 06:30:24PM +0800, Peter Zijlstra wrote:
>>>>>> On Sat, 2010-11-13 at 16:40 +0800, Wu Fengguang wrote:
>>>>>>>> Will try and figure out how the heck that's happening, Ingo any clue?
>>>>>>>
>>>>>>> It's back to normal on 2.6.37-rc1 when reverting commit 50f2d7f682f9
>>>>>>> ("x86, numa: Assign CPUs to nodes in round-robin manner on fake NUMA").
>>>>>>>
>>>>>>> The interesting part is, the commit was introduced in
>>>>>>> 2.6.36-rc7..2.6.36, however 2.6.36 boots OK, while 2.6.37-rc1 panics.
>>>>>>
>>>>>> Argh, that commit again..
>>>>>>
>>>>>> Does this fix it:
http://lkml.org/lkml/2010/11/12/8
>>>>>
>>>>> No it still panics. Here is the dmesg.
>>>>
>>>> OK, I'll let Nikanth have a look, if all else fails we can always
>>>> revert that patch.
>>>
>>> It's the same bug.
>>>
>>> Just tried another machine, I get the same divide error. The patch
>>> posted in lkml/2010/11/12/8 does not fix it. But after reverting
>>> commit 50f2d7f682f9, it boots OK.
>>>
>>> Thanks,
>>> Fengguang
>>> ---
>>> PS. dmesg with divide error
>>>
>>> [ 0.000000] console [ttyS0] enabled, bootconsole disabled
>>> [ 0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
>>> [ 0.000000] ... MAX_LOCKDEP_SUBCLASSES: 8
>>> [ 0.000000] ... MAX_LOCK_DEPTH: 48
>>> [ 0.000000] ... MAX_LOCKDEP_KEYS: 8191
>>> [ 0.000000] ... CLASSHASH_SIZE: 4096
>>> [ 0.000000] ... MAX_LOCKDEP_ENTRIES: 16384
>>> [ 0.000000] ... MAX_LOCKDEP_CHAINS: 32768
>>> [ 0.000000] ... CHAINHASH_SIZE: 16384
>>> [ 0.000000] memory used by lock dependency info: 6367 kB
>>> [ 0.000000] per task-struct memory footprint: 2688 bytes
>>> [ 0.000000] allocated 167772160 bytes of page_cgroup
>>> [ 0.000000] please try 'cgroup_disable=memory' option if you don't want memory cgroups
>>> [ 0.000000] ODEBUG: 15 of 15 active objects replaced
>>> [ 0.000000] hpet clockevent registered
>>> [ 0.001000] Fast TSC calibration using PIT
>>> [ 0.002000] Detected 2800.469 MHz processor.
>>> [ 0.000010] Calibrating delay loop (skipped), value calculated using timer frequency.. 5600.93 BogoMIPS (lpj=2800469)
>>> [ 0.010818] pid_max: default: 32768 minimum: 301
>>> [ 0.021745] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes)
>>> [ 0.035657] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes)
>>> [ 0.044553] Mount-cache hash table entries: 256
>>> [ 0.049469] Initializing cgroup subsys debug
>>> [ 0.053834] Initializing cgroup subsys ns
>>> [ 0.057940] ns_cgroup deprecated: consider using the 'clone_children' flag without the ns_cgroup.
>>> [ 0.066968] Initializing cgroup subsys cpuacct
>>> [ 0.071511] Initializing cgroup subsys memory
>>> [ 0.075988] Initializing cgroup subsys devices
>>> [ 0.080527] Initializing cgroup subsys freezer
>>> [ 0.085107] CPU: Physical Processor ID: 0
>>> [ 0.089209] CPU: Processor Core ID: 0
>>> [ 0.092974] mce: CPU supports 9 MCE banks
>>> [ 0.097095] CPU0: Thermal monitoring enabled (TM1)
>>> [ 0.101990] using mwait in idle threads.
>>> [ 0.106006] Performance Events: PEBS fmt1+, Westmere events, Intel PMU driver.
>>> [ 0.113535] ... version: 3
>>> [ 0.117641] ... bit width: 48
>>> [ 0.121828] ... generic registers: 4
>>> [ 0.125926] ... value mask: 0000ffffffffffff
>>> [ 0.131328] ... max period: 000000007fffffff
>>> [ 0.136734] ... fixed-purpose events: 3
>>> [ 0.140839] ... event mask: 000000070000000f
>>> [ 0.147297] ACPI: Core revision 20101013
>>> [ 0.175646] ftrace: allocating 24175 entries in 95 pages
>>> [ 0.190912] Setting APIC routing to flat
>>> [ 0.195562] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
>>> [ 0.211643] CPU0: Intel(R) Xeon(R) CPU X5660 @ 2.80GHz stepping 01
>>> [ 0.325243] lockdep: fixing up alternatives.
>>> [ 0.330242] Booting Node 0, Processors #1lockdep: fixing up alternatives.
>>> [ 0.430140] #2lockdep: fixing up alternatives.
>>> [ 0.526962] #3lockdep: fixing up alternatives.
>>> [ 0.623755] #4lockdep: fixing up alternatives.
>>> [ 0.720588] Ok.
>>> [ 0.722525] Booting Node 1, Processors #5lockdep: fixing up alternatives.
>>> [ 0.822389] Ok.
>>> [ 0.824327] Booting Node 0, Processors #6
>>> [ 0.919089] TSC synchronization [CPU#0 -> CPU#6]:
>>> [ 0.924155] Measured 296 cycles TSC warp between CPUs, turning off TSC clock.
>>> [ 0.003999] Marking TSC unstable due to check_tsc_sync_source failed
>>> [ 0.557048] lockdep: fixing up alternatives.
>>> [ 0.558041] Ok.
>>> [ 0.559004] Booting Node 1, Processors #7 Ok.
>>> [ 0.632157] Brought up 8 CPUs
>>> [ 0.633006] Total of 8 processors activated (44799.46 BogoMIPS).
>>
>> assume that when you have
>> CONFIG_NR_CPUS=16
>> instead of
>> CONFIG_NR_CPUS=8
>>
>> it will boot ok?
>
> No. But it boots OK with CONFIG_NR_CPUS=64: it actually has 24 CPUs, a bit more
> than your expectation :)
>
> This also boots the other 16 CPU box that used to lockup in find_busiest_group().