[BUG] perf: hard lockup when using perf-sched

Previous thread: [RFC] Input: implement sysrq as an input handler by Dmitry Torokhov on Tuesday, March 16, 2010 - 11:19 pm. (13 messages)

Next thread: [PATCH v2] KVM: cleanup {kvm_vm_ioctl, kvm}_get_dirty_log() by Xiao Guangrong on Wednesday, March 17, 2010 - 12:49 am. (6 messages)
From: Li Zefan
Date: Wednesday, March 17, 2010 - 12:21 am

An idea came to me that running perf-sched with a fork "bomb" might
reveal some bug.

So I wrote a program (attached), and did the following test:

# gcc fork.c
# ./a.out &
# perf sched record

After a few seconds, the machine froze, without printing any message.

I've tried the test both on v2.6.34-rc1 and v2.6.34-rc1-tip+.

I'll send you any other information you need.
From: Li Zefan
Date: Tuesday, March 23, 2010 - 10:52 pm

Anyone have a look at this bug ?


--

From: Mike Galbraith
Date: Wednesday, March 24, 2010 - 12:32 am

I just saw this, hunted down your testcase and tried it here.  Looks
like perf_output_lock() wedged box.

[  212.439223] BUG: NMI Watchdog detected LOCKUP on CPU2, ip ffffffff81078ca2, registers:
[  212.439223] CPU 2 
[  212.439223] Modules linked in: cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq snd_pcm_oss snd_mixer_oss microcode snd_seq snd_seq_device fuse loop dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer firewire_ohci firewire_core crc_itu_t usb_storage snd ohci1394 soundcore snd_page_alloc usb_libusual rtc_cmos ieee1394 thermal button processor rtc_core sr_mod rtc_lib e1000e i2c_i801 cdrom sg usbhid hid uhci_hcd ehci_hcd sd_mod usbcore edd fan ext3 ext2 mbcache jbd ahci libata scsi_mod
[  212.439223] 
[  212.439223] Pid: 0, comm: swapper Not tainted 2.6.34-smpx #1475 MS-7502/MS-7502
[  212.439223] RIP: 0010:[<ffffffff81078ca2>]  [<ffffffff81078ca2>] perf_output_begin+0x81/0x191
[  212.439223] RSP: 0018:ffff880001703af0  EFLAGS: 00000086
[  212.439223] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000002
[  212.439223] RDX: 0000000000000058 RSI: ffff88003715c000 RDI: ffff880001703b40
[  212.439223] RBP: 0000000000000000 R08: 0000000000000001 R09: ffff8800bb81fc00
[  212.439223] R10: 0000000000000200 R11: 0000000000000000 R12: 0000000000000001
[  212.439223] R13: ffff88000170f9e0 R14: ffff880001703c90 R15: ffff880001703b70
[  212.439223] FS:  0000000000000000(0000) GS:ffff880001700000(0000) knlGS:0000000000000000
[  212.439223] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  212.439223] CR2: 00007f1b2e3aa4a8 CR3: 0000000001420000 CR4: 00000000000006e0
[  212.439223] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  212.439223] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  212.439223] Process swapper (pid: 0, threadinfo ffff8800bf892000, task ffff8800bf843450)
[  212.439223] Stack:
[  212.439223]  ffff880001703b70 ffffffff810797a3 ...
From: Mike Galbraith
Date: Wednesday, March 24, 2010 - 1:17 am

(turns on frame pointers, and adds noinline)

[   90.276644] BUG: NMI Watchdog detected LOCKUP on CPU2, ip ffffffff812b54a8, registers:
[   90.276644] CPU 2 
[   90.276644] Modules linked in: cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq snd_pcm_oss snd_mixer_oss microcode snd_seq snd_seq_device fuse loop dm_mod snd_hda_codec_realtek snd_hda_intel firewire_ohci firewire_core snd_hda_codec usbmouse crc_itu_t snd_hwdep snd_pcm usb_storage snd_timer sr_mod usbhid ohci1394 snd rtc_cmos soundcore rtc_core ieee1394 usb_libusual thermal i2c_i801 processor button e1000e snd_page_alloc cdrom rtc_lib hid sg uhci_hcd ehci_hcd sd_mod usbcore edd fan ext3 ext2 mbcache jbd ahci libata scsi_mod
[   90.276644] 
[   90.276644] Pid: 20237, comm: forkbomb Not tainted 2.6.34-smpx #1477 MS-7502/MS-7502
[   90.276644] RIP: 0010:[<ffffffff812b54a8>]  [<ffffffff812b54a8>] _raw_spin_lock+0x12/0x1a
[   90.276644] RSP: 0018:ffff880001703ee8  EFLAGS: 00000097
[   90.276644] RAX: 000000000000c3c2 RBX: ffff880001712e40 RCX: 0000001504ab3a62
[   90.276644] RDX: 0000001504ab3b83 RSI: 0000000000000026 RDI: ffff880001712e40
[   90.276644] RBP: ffff880001703ee8 R08: 000000001ba05e30 R09: ffff880001703f88
[   90.276644] R10: 0000000000000038 R11: 0000000000000038 R12: 0000000000000002
[   90.276644] R13: ffff8800b48de2d0 R14: 0000000000000038 R15: 00000000ffffffe4
[   90.276644] FS:  0000000000000000(0000) GS:ffff880001700000(0000) knlGS:0000000000000000
[   90.276644] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   90.276644] CR2: 00007ff05702a4a8 CR3: 0000000001430000 CR4: 00000000000006e0
[   90.276644] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   90.276644] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   90.276644] Process forkbomb (pid: 20237, threadinfo ffff8800bd3f8000, task ffff8800b48de2d0)
[   90.276644] Stack:
[   90.276644]  ffff880001703f18 ffffffff8102ecc4 ffff8800b48de2d0 ffff8800b48de2d0
[   90.276644] <0> ...
From: Li Zefan
Date: Thursday, March 25, 2010 - 1:04 am

Thanks! Then who's going to fix this...

--

From: Mike Galbraith
Date: Thursday, March 25, 2010 - 2:27 am

Well, that kinda depends on whether I figure out how the heck it's all
supposed to work before somebody else whacks it or not.

ATM, I've instrumented, know _what's_ happening, but find myself saying
"wtf?" a lot, especially wrt handle->locked.  The act of attempting to
lock a handle declares it unlocked, turning perf_output_unlock() into a
noop, which looks a bit strange.  We're spinning on those "unlocked"
locks, all left genuinely locked by one CPU.  I just whacked the thing,
and am very likely about to see in yet another trace.

Locking is hard, "curious construct" locking is even harder :)

	-Mike

--

From: Peter Zijlstra
Date: Thursday, March 25, 2010 - 2:54 am

Yeah, that perf_output_{un,}lock() stuff is a pain.. 

handle->locked should indicate if we're the outer most context

the problem this stuff is solving is that of publishing the data head
pointer, suppose two concurrent writers:

 head = 0;

 A: reserve N bytes
    write content
    commit

 B: reserve M bytes
    write content
    commit

Now, if B happens during A, and B completes before A is done, B's commit
would expose a head pointer of N+M bytes, even though the first N aren't
completely written yet.

So the trick is to only update the head pointer once everybody is done
writing. You can model this as a recursion problem and only update once
the recursion count hits 0 again.

However, with multiple cpus and NMIs and a busy enough environment this
never needs to happen, which would be bad, since we then cannot make
progress.

Hence the curious construct which basically serializes per cpu. Since
each cpu can only have a limited nesting context, and any one cpu is
bound to make progress (unless of course there is another bug that
prevents that) we'll post regular updates to the head pointer.



--

From: Mike Galbraith
Date: Friday, March 26, 2010 - 3:11 am

This seems to work, in contrast to everything I tried yesterday.  Not
exactly a thing of beauty, but at least it's an option, so...

perf: fix perf sched record forkbomb deadlock

perf sched record can deadlock a box should the holder of handle->data->lock
take an interrupt, and then attempt to acquire an rq lock held by a CPU trying
to acquire the same lock.  Disable interrupts.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <new-submission>

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 574ee58..2ba2e9f 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3378,15 +3378,23 @@ static void perf_event_task_output(struct perf_event *event,
 				     struct perf_task_event *task_event)
 {
 	struct perf_output_handle handle;
-	int size;
 	struct task_struct *task = task_event->task;
-	int ret;
+	unsigned long flags;
+	int size, ret;
+
+	/*
+	 * If this CPU attempts to acquire an rq lock held by a CPU spinning
+	 * in perf_output_lock() from interrupt context, it's game over.
+	 */
+	local_irq_save(flags);
 
 	size  = task_event->event_id.header.size;
 	ret = perf_output_begin(&handle, event, size, 0, 0);
 
-	if (ret)
+	if (ret) {
+		local_irq_restore(flags);
 		return;
+	}
 
 	task_event->event_id.pid = perf_event_pid(event, task);
 	task_event->event_id.ppid = perf_event_pid(event, current);
@@ -3397,6 +3405,7 @@ static void perf_event_task_output(struct perf_event *event,
 	perf_output_put(&handle, task_event->event_id);
 
 	perf_output_end(&handle);
+	local_irq_restore(flags);
 }
 
 static int perf_event_task_match(struct perf_event *event)


--

From: Frederic Weisbecker
Date: Friday, March 26, 2010 - 10:23 am

Aah.

So the scenario is the following inversion?

	CPU0				CPU1
sched event with rq->lock held
				grab handle->data->lock
spin on handle->data->lock
				interrupt
				try to grab rq->lock

--

From: Mike Galbraith
Date: Friday, March 26, 2010 - 12:10 pm

Yeah, handle->data->lock holder dare not try to grab any rq lock because
of sched event with rq->lock held.

	-Mike

--

From: Frederic Weisbecker
Date: Friday, March 26, 2010 - 12:27 pm

But if that happens with perf sched, there is something weird.
perf sched only use sched events, which have interrupt disabled
from the trace event handler, so this is not supposed to happen.

But if there is another kind of event involved, something that has
interrupts enabled, may be some software events, then it may
happen indeed.

--

From: Mike Galbraith
Date: Friday, March 26, 2010 - 1:22 pm

Hm.  Last trace I took is below.

./forkbomb&
[1] 5990
marge:/root/tmp # perf sched record
[  427.931717] BUG: NMI Watchdog detected LOCKUP on CPU1, ip ffffffff810853f7, registers:
[  427.931717] CPU 1 
[  427.931717] Modules linked in: nfsd lockd nfs_acl auth_rpcgss sunrpc snd_pcm_oss snd_mixer_oss exportfs snd_seq snd_seq_device cpufreq_conservative microcode cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq fuse loop dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm firewire_ohci usbmouse firewire_core snd_timer crc_itu_t snd usbhid usb_storage rtc_cmos hid soundcore ohci1394 usb_libusual rtc_core sr_mod thermal snd_page_alloc ieee1394 i2c_i801 button processor rtc_lib cdrom e1000e sg uhci_hcd ehci_hcd sd_mod usbcore edd fan ext3 ext2 mbcache jbd ahci libata scsi_mod
[  427.931717] 
[  427.931717] Pid: 0, comm: swapper Tainted: G        W  2.6.34-smpx #1500 MS-7502/MS-7502
[  427.931717] RIP: 0010:[<ffffffff810853f7>]  [<ffffffff810853f7>] perf_output_lock+0x30/0x5a
[  427.931717] RSP: 0018:ffff880002083910  EFLAGS: 00000082
[  427.931717] RAX: 0000000000000002 RBX: ffff8800020839d0 RCX: ffff8800bc91c000
[  427.931717] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800020839d0
[  427.931717] RBP: ffff880002083920 R08: 0000000000000001 R09: 0000000000000001
[  427.931717] R10: 0000000000000000 R11: ffff8800bf845940 R12: ffff8800bc91c000
[  427.931717] R13: ffff8800bc36a400 R14: 0000000000000060 R15: 0000000000000001
[  427.931717] FS:  0000000000000000(0000) GS:ffff880002080000(0000) knlGS:0000000000000000
[  427.931717] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  427.931717] CR2: 00007ff8e6f464a8 CR3: 000000000146e000 CR4: 00000000000006e0
[  427.931717] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  427.931717] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  427.931717] Process swapper (pid: 0, threadinfo ffff8800bf888000, task ffff8800bf845940)
[  427.931717] Stack:
[  427.931717] ...
From: Frederic Weisbecker
Date: Friday, March 26, 2010 - 2:41 pm

Hmm, Yeah that's indeed what the traces show.
And your patch really fix this on your box?

--

From: Frederic Weisbecker
Date: Friday, March 26, 2010 - 2:45 pm

Aaah, I see now. That happens during a fork event!
So that looks indeed the right fix.

--

From: Li Zefan
Date: Sunday, March 28, 2010 - 7:01 pm

From: tip-bot for Mike Galbraith
Date: Friday, April 2, 2010 - 12:07 pm

Commit-ID:  8bb39f9aa068262732fe44b965d7a6eb5a5a7d67
Gitweb:     http://git.kernel.org/tip/8bb39f9aa068262732fe44b965d7a6eb5a5a7d67
Author:     Mike Galbraith <efault@gmx.de>
AuthorDate: Fri, 26 Mar 2010 11:11:33 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 2 Apr 2010 19:30:05 +0200

perf: Fix 'perf sched record' deadlock

perf sched record can deadlock a box should the holder of
handle->data->lock take an interrupt, and then attempt to
acquire an rq lock held by a CPU trying to acquire the
same lock. Disable interrupts.

   CPU0                            CPU1
   sched event with rq->lock held
                                   grab handle->data->lock
   spin on handle->data->lock
                                   interrupt
                                   try to grab rq->lock

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1269598293.6174.8.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index b0feb47..96aae13 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3376,15 +3376,23 @@ static void perf_event_task_output(struct perf_event *event,
 				     struct perf_task_event *task_event)
 {
 	struct perf_output_handle handle;
-	int size;
 	struct task_struct *task = task_event->task;
-	int ret;
+	unsigned long flags;
+	int size, ret;
+
+	/*
+	 * If this CPU attempts to acquire an rq lock held by a CPU spinning
+	 * in perf_output_lock() from interrupt context, it's game over.
+	 */
+	local_irq_save(flags);
 
 	size  = task_event->event_id.header.size;
 	ret = perf_output_begin(&handle, event, size, 0, 0);
 ...
Previous thread: [RFC] Input: implement sysrq as an input handler by Dmitry Torokhov on Tuesday, March 16, 2010 - 11:19 pm. (13 messages)

Next thread: [PATCH v2] KVM: cleanup {kvm_vm_ioctl, kvm}_get_dirty_log() by Xiao Guangrong on Wednesday, March 17, 2010 - 12:49 am. (6 messages)