I'm testing host cpu hotplug with kvm. Basically running 7 guests on a 4 core machine, offlining and onlining host cpus at random. Eventually I hit this: [4298303.496645] Booting processor 3/7 ip 6000 [4298303.506116] Initializing CPU#3 [4298303.506116] Calibrating delay using timer specific routine.. 5319.66 BogoMIPS (lpj=2659833) [4298303.506116] CPU: L1 I cache: 32K, L1 D cache: 32K [4298303.506116] CPU: L2 cache: 4096K [4298303.506116] CPU: Physical Processor ID: 3 [4298303.506116] CPU: Processor Core ID: 1 [4298303.506116] x86 PAT enabled: cpu 3, old 0x7040600070406, new 0x7010600070106 [4298303.582937] CPU3: Intel(R) Xeon(R) CPU 5150 @ 2.66GHz stepping 06 [4298303.585087] checking TSC synchronization [CPU#0 -> CPU#3]: passed. [4298303.707287] Switched to high resolution mode on CPU 3 [4298303.712943] kvm: enabling virtualization on CPU3 [4298303.713955] CPU0 attaching sched-domain: [4298303.713901] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158 [4298303.713901] IP: [<ffffffff8022e722>] pick_next_task_fair+0x55/0x7c [4298303.713901] PGD 0 [4298303.713901] Oops: 0000 [1] PREEMPT SMP [4298303.713901] CPU 3 [4298303.713901] Modules linked in: kvm_intel kvm netconsole autofs4 nfs lockd nfs_acl sunrpc bridge llc acpi_cpufreq backlight sg e1000 button serio_raw rtc_cmos rtc_core rtc_lib ata_piix dm_snapshot dm_mod ahci libata dock sd_mod scsi_mod [last unloaded: kvm] [4298303.713901] Pid: 15115, comm: migration/3 Not tainted 2.6.26-rc2 #723 [4298303.713901] RIP: 0010:[<ffffffff8022e722>] [<ffffffff8022e722>] pick_next_task_fair+0x55/0x7c [4298303.713901] RSP: 0018:ffff81004fdfbe20 EFLAGS: 00010046 [4298303.713901] RAX: 0000000000000000 RBX: ffff81000103df80 RCX: 0000000000000000 [4298303.713901] RDX: ffff81000103e038 RSI: 000000003b9aca00 RDI: ffff81000103df00 [4298303.713901] RBP: ffff81004fdfbe40 R08: ffff81004fdfbdd0 R09: ffff81000103a0a0 [4298303.713901] R10: 0000000000000000 R11: 0000000000000003 R12: ...
I get the same oops running with the kvm modules unloaded. None of kvm.git's changes are in core code, so this is definitely a mainline problem. It took about 40 minutes to reproduce; 1200 cpus onlined and 1200 cpus offlined. -- error compiling committee.c: too many arguments to function --
FWIW, this happens on s390 as well (without kvm support). But until now I didn't have time to look into it. Didn't report it here yet, since we had plenty of cpu hotplug changes in arch code, so I suspected a bug there. This doesn't seem to be the case. It takes only a few minutes to trigger the bug here. --
[ cc'ed a few folks. ]
So the cfs-tree likely gets out-of-sync. I pressume, it won't be
reproducible with CONFIG_SCHED_GROUP options being disabled.
Anyway, would you try one of these debug-patches (not sure about the
workability of the second one though :-/)
Let's check what are the values for 'cfs_rq->weight.load/nr_running'.
thanks in advance,
(non-whitespace-damaged versions are enclosed)
---
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1291,6 +1291,12 @@ static struct task_struct
*pick_next_task_fair(struct rq *rq)
do {
se = pick_next_entity(cfs_rq);
+
+ if (unlikely(!se))
+ printk(KERN_ERR "BUG: se == NULL but
nr_running (%ld), load (%ld),"
+ " rq-nr_running (%ld), rq-load (%ld)\n",
+ cfs_rq->nr_running,
cfs_rq->load.weight, rq->nr_running, rq->load.weight);
+
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
---
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1280,6 +1280,8 @@ static void check_preempt_wakeup(struct rq *rq,
struct task_str uct *p)
resched_task(curr);
}
+static void sysrq_sched_debug_show(void);
+
static struct task_struct *pick_next_task_fair(struct rq *rq)
{
struct task_struct *p;
@@ -1291,6 +1293,10 @@ static struct task_struct
*pick_next_task_fair(struct rq *rq)
do {
se = pick_next_entity(cfs_rq);
+
+ if (unlikely(!se))
+ sysrq_sched_debug_show();
+
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
---
--
Best regards,
Dmitry Adamushko
Got this for the first patch: [4302727.615522] Booting processor 3/7 ip 6000 [4302727.625923] Initializing CPU#3 [4302727.625923] Calibrating delay using timer specific routine.. 5319.76 BogoMIPS (lpj=2659883) [4302727.625923] CPU: L1 I cache: 32K, L1 D cache: 32K [4302727.625923] CPU: L2 cache: 4096K [4302727.625923] CPU: Physical Processor ID: 3 [4302727.625923] CPU: Processor Core ID: 1 [4302727.625923] x86 PAT enabled: cpu 3, old 0x7040600070406, new 0x7010600070106 [4302727.692484] CPU3: Intel(R) Xeon(R) CPU 5150 @ 2.66GHz stepping 06 [4302727.694236] checking TSC synchronization [CPU#1 -> CPU#3]: passed. [4302727.824185] Switched to high resolution mode on CPU 3 [4302727.859184] kvm: enabling virtualization on CPU3 [4302727.859714] Sched Debug Version: v0.07, 2.6.26-rc2 #726 [4302727.859714] now at 6918576.148656 msecs [4302727.859714] .sysctl_sched_latency : 60.000000 [4302727.859714] .sysctl_sched_min_granularity : 12.000000 [4302727.859714] .sysctl_sched_wakeup_granularity : 30.000000 [4302727.859714] .sysctl_sched_child_runs_first : 0.000001 [4302727.860191] .sysctl_sched_features : 895 [4302727.860191] [4302727.860191] cpu#0, 2659.999 MHz [4302727.860191] .nr_running : 2 [4302727.860191] .load : 841 [4302727.860191] .nr_switches : 3427530 [4302727.861205] .nr_load_updates : 2183358 [4302727.861205] .nr_uninterruptible : 15 [4302727.861205] .jiffies : 4301585875 [4302727.861205] .next_balance : 4301.585696 [4302727.861205] .curr->pid : 4678 [4302727.861205] .clock : 6918579.002757 [4302727.862216] .cpu_load[0] : 841 [4302727.862216] .cpu_load[1] : 841 [4302727.862216] .cpu_load[2] : 841 [4302727.862216] .cpu_load[3] ...
Err... sorry for the broken patch. The patch below on top of the
previous one should address this issue (ugly, but should be ok for
debugging).
'tasklist_lock' shouldn't cause a double lock, I guess.
Sorry for rather 'blind' attempts. If no, then I'll
prepare/test/take-a-closer-look at it later today when I'm at home.
TIA,
------ kernel/sched_debug-prev.c 2008-05-14 14:53:28.000000000 +0200
+++ kernel/sched_debug.c 2008-05-14 14:58:12.000000000 +0200
@@ -125,6 +125,7 @@ void print_cfs_rq(struct seq_file *m, in
char path[128] = "";
struct cgroup *cgroup = NULL;
struct task_group *tg = cfs_rq->tg;
+ int was_locked;
if (tg)
cgroup = tg->css.cgroup;
@@ -138,7 +139,11 @@ void print_cfs_rq(struct seq_file *m, in
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "exec_clock",
SPLIT_NS(cfs_rq->exec_clock));
- spin_lock_irqsave(&rq->lock, flags);
+ was_locked = spin_is_locked(&rq->lock);
+
+ if (!was_locked)
+ spin_lock_irqsave(&rq->lock, flags);
+
if (cfs_rq->rb_leftmost)
MIN_vruntime = (__pick_next_entity(cfs_rq))->vruntime;
last = __pick_last_entity(cfs_rq);
@@ -146,7 +151,10 @@ void print_cfs_rq(struct seq_file *m, in
max_vruntime = last->vruntime;
min_vruntime = rq->cfs.min_vruntime;
rq0_min_vruntime = per_cpu(runqueues, 0).cfs.min_vruntime;
- spin_unlock_irqrestore(&rq->lock, flags);
+
+ if (!was_locked)
+ spin_unlock_irqrestore(&rq->lock, flags);
+
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "MIN_vruntime",
SPLIT_NS(MIN_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime",
---
--
Best regards,
Dmitry Adamushko
Now it died when allocating the pda: [4301986.526627] Booting processor 1/6 ip 6000 [4301986.537195] swapper: page allocation failure. order:2, mode:0x20 [4301986.537195] Pid: 0, comm: swapper Not tainted 2.6.26-rc2 #728 [4301986.537195] [4301986.537195] Call Trace: [4301986.537195] [<ffffffff802388ae>] ? __printk_ratelimit+0x9/0xb [4301986.537195] [<ffffffff80272e07>] __alloc_pages_internal+0x387/0x39f [4301986.537195] [<ffffffff80272e35>] __alloc_pages+0xb/0xd [4301986.537195] [<ffffffff80272e4c>] __get_free_pages+0x15/0x46 [4301986.537195] [<ffffffff80210cdf>] pda_init+0x87/0xac [4301986.537195] [<ffffffff8042fed1>] cpu_init+0x5e/0x34a [4301986.537195] [<ffffffff8026b437>] ? rcu_needs_cpu+0x36/0x47 [4301986.537195] [<ffffffff802311a7>] ? idle_task_exit+0x71/0xad [4301986.537195] [<ffffffff804331aa>] start_secondary+0xc/0x178 [4301986.537195] [4301986.537195] Mem-info: [4301986.537195] DMA per-cpu: [4301986.537195] CPU 0: hi: 0, btch: 1 usd: 0 [4301986.537195] CPU 2: hi: 0, btch: 1 usd: 0 [4301986.537195] DMA32 per-cpu: [4301986.537195] CPU 0: hi: 186, btch: 31 usd: 153 [4301986.537195] CPU 2: hi: 186, btch: 31 usd: 135 [4301986.537195] Active:347879 inactive:109278 dirty:107 writeback:0 unstable:0 [4301986.537195] free:5311 slab:10186 mapped:30496 pagetables:1587 bounce:0 [4301986.537195] DMA free:8012kB min:24kB low:28kB high:36kB active:60kB inactive:0kB present:8644kB pages_scanned:0 all_unreclaimable? no [4301986.537195] lowmem_reserve[]: 0 1999 1999 1999 [4301986.537195] DMA32 free:13232kB min:5704kB low:7128kB high:8556kB active:1391456kB inactive:437112kB present:2047248kB pages_scanned:0 all_unreclaimable? no [4301986.537195] lowmem_reserve[]: 0 0 0 0 [4301986.537195] DMA: 19*4kB 16*8kB 10*16kB 11*32kB 6*64kB 4*128kB 1*256kB 2*512kB 3*1024kB 1*2048kB 0*4096kB = 8012kB [4301986.537195] DMA32: 2034*4kB 570*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 13224kB [4301986.537195] 212750 ...
Hi Avi, I think you fixed this specific bug. Does cpu hotplug stresstest now work for you? --
Which bug, the order-2 allocs? I certainly didn't fix it. Last I tried, cpu hotplug still failed, but it was some time ago as patches flow. -- error compiling committee.c: too many arguments to function --
Ah no. I thought "x86: fix crash on cpu hotplug on pat-incapable machines"
would have fixed the cpu hotplug bug for you.
I was actually talking of the pick_next_task_fair incarnation of one of
the cpu hotplug bugs.
s390 backtrace on 2.6.25 looks like this.
<1>Unable to handle kernel pointer dereference at virtual kernel address 00000200000db000
<4>Oops: 003b [#1] PREEMPT SMP
<4>Modules linked in: dm_multipath sunrpc qeth_l2 dm_mod cu3088 qeth ccwgroup
<4>CPU: 0 Not tainted 2.6.25-27.x.20080422-s390xdefault #1
<4>Process migration/0 (pid: 5806, task: 00000000aac3a838, ksp: 000000017a2f3dd0)
<4>Krnl PSW : 0400200180000000 0000000000125f48 (pick_next_task_fair+0x34/0xd4)
<4> R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:2 PM:0 EA:3
<4>Krnl GPRS: 0000079e029c6ecb 0000000000000000 000000008548ea60 000000008548d700
<4> 00000000004526f8 0000000000000597 0000000000000000 0000000000000000
<4> 0000000000672a00 0000000000000000 00000000aac3a838 000000017a2f3d10
<4> 00000200000db85a 00000000004672a0 000000017a2f3d50 000000017a2f3d10
<4>Krnl Code: 0000000000125f38: e3e0f0980024 stg %r14,152(%r15)
<4> 0000000000125f3e: d507d000c010 clc 0(8,%r13),16(%r12)
<4> 0000000000125f44: a784004e brc 8,125fe0
<4> >0000000000125f48: d507d000c030 clc 0(8,%r13),48(%r12)
<4> 0000000000125f4e: b904002c lgr %r2,%r12
<4> 0000000000125f52: a7990000 lghi %r9,0
<4> 0000000000125f56: a7840033 brc 8,125fbc
<4> 0000000000125f5a: c0e5ffffee23 brasl %r14,123ba0
<4>Call Trace:
<4>([<00000000aac3a838>] 0xaac3a838)
<4> [<000000000045270c>] schedule+0x178/0x978
<4> [<000000000012fe86>] migration_thread+0x1aa/0x30c
<4> [<0000000000152ad8>] kthread+0x68/0xa0
<4> [<000000000010a0be>] kernel_thread_starter+0x6/0xc
...No, that only fixes an immediate oops when running in a virtual machine Seems to be the same bug, yes. -- error compiling committee.c: too many arguments to function --
Gentlemen, I built an ia64 kernel from Andrew's tree (2.6.26-rc2-mm1) and get a very predictable hotplug cpu problem. billberry1:/tmp/cpw # ./dis disabled cpu 17 enabled cpu 17 billberry1:/tmp/cpw # ./dis disabled cpu 17 enabled cpu 17 billberry1:/tmp/cpw # ./dis The script that disables the cpu always hangs (unkillable) on the 3rd attempt. I haven't spent any debugging time on it yet. Just wondering if you've seen it? It doesn't seem to happen x86_64. -Cliff Wickman cpw@sgi.com --
