I'll test tip later today to see if I can reproduce it. Cheers, Don
FYI, it does not trigger on an AMD box. Thanks, Ingo --
Ok, to not hold up the perf/urgent flow i zapped these two commits for the time being: 4a31beb: perf, x86: Fix handle_irq return values 8e3e42b: perf, x86: Try to handle unknown nmis with an enabled PMU We can apply them if they take a form that dont introduce a different kind of (and more visible) regression. Thanks, Ingo --
Btw, guys, I fail to see how new nmi_watchdog work, we have
default_do_nmi
if (!(reason & 0xc0)) {
if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) == == NOTIFY_STOP)
return
if (nmi_watchdog_tick(regs, reason))
return
but perf_event_nmi_handler returns NOTIFY_STOP when watchdog is perf event
and nmi_watchdog_tick _never_ called, or (most probably) I miss something?
-- Cyrill
--
The watchdog is disabled during profiling (perf and oprofile) by calling disable_lapic_nmi_watchdog(). -Robert -- Advanced Micro Devices, Inc. Operating System Research Center --
Huh? iirc Don have switched nmi watchdog to native perf subsystem, ie watchdog uses PERF_COUNT_HW_CPU_CYCLES event, letme check... -- Cyrill --
False alarm, perf watchdog uses own handler. Sorry for noise ;) -- Cyrill --
Ingo, are those commits in some branch to fetch them for testing? I did not find them in tip/master. Thanks, -Robert -- Advanced Micro Devices, Inc. Operating System Research Center --
Robert, they should be in -tip tree, I have
| commit 4a31bebe71ab2e4f6ecd6e5f9f2ac9f0ff38ff76
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date: Thu Aug 19 16:28:00 2010 +0200
|
| perf, x86: Fix handle_irq return values
-- Cyrill
--
Has anyone else seen kernel panics when enabling the LOCKUP_DETECTOR or apply Robert's patch? I keep getting a panic with unable to handle paging request everytime either Robert's patch or the lockup detector uses __get_cpu_var(). And only on 2.6.36 code. Cheers, Don --
I use 2.6.34 atm. Letme try 2.6.36 (which might require some time to recompile). -- Cyrill --
On Tue, Aug 24, 2010 at 10:54:58PM +0400, Cyrill Gorcunov wrote: Don, for me it fails with somehow unrelated page handling fault - in reiserfs_evict_inode O_o Fails in __get_cpu_var I suspect might means a problem either in per-cpu allocator itself or we screw pointer somehow. Weird. -- Cyrill --
I just found out (with the help of the crash utility and Dave A.) that Robert's percpu struct nmi clashes with the exception entry point .nmi. I only see this problem in 2.6.36, so I am not sure what changed with regards to compiler flags to confuse variables with text segments. But renaming the percpu struct nmi to nmidon fixed the problem for me (I am open to other suggestions :-) ). Regarding your reiserfs, what was the variable's name? Cheers, Don --
yeah, I suspect name clashes here but then I did grep over per-cpu variables in whole kernel and didn't find match so I thought the assumption was wrong, nmi_don_zickus ;) well, I think nmi_pmu or something like that It seems to be different, it's pity that I had only 80x25 vga mode and was unable to snap the whole log. But actually I didn't even check precisely all .config options I had set since I was more interested in early stage where per-cpu access should already happen rather then real init'ed environmen. But I think I'll be -- Cyrill --
There was a missing "return" in the end of reiserfs_evict_inode. It has been fixed there: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f4ae2f... --
On Thu, Aug 26, 2010 at 3:52 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote: Thanks for info, Frederic! --
I was testing the patches also with CONFIG_LOCKUP_DETECTOR=y without Also I was looking at the differences between .35 and tip/perf/urgent and could not find a namespace collision. Hmm... -Robert -- Advanced Micro Devices, Inc. Operating System Research Center --
So this patch fixes it, though I haven't convince myself why (perhaps babysitting my 4 month old isn't helping :-)) The code now enters the loop and reprocesses the new status which properly increments handled to 2 and thus the new logic takes care of it. Cheers, Don diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 4539b4b..d16ebd8 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -738,6 +738,7 @@ again: inc_irq_stat(apic_perf_irqs); ack = status; + intel_pmu_ack_status(ack); intel_pmu_lbr_read(); @@ -766,8 +767,6 @@ again: x86_pmu_stop(event); } - intel_pmu_ack_status(ack); - /* * Repeat if there is more work to be done: */ --
Yes, not immediately ack'ing the status was suspect to me too. Though it must then be the same counter that retriggers. Or, it is a cpu bug. You could add a debug print of the status register for the case the loop is reentered, would be interesting... Thank for fixing this. -- Advanced Micro Devices, Inc. Operating System Research Center --
It seems to be the same counter. I wonder if the act of processing it 'intel_save_and_restart' cleared the status bit on the perfselX bit but not in the global status register. Thus it triggered again and we accidentally cleared it when we ack'd later in the code. Cheers, Don --
I would slightly change the patch: There is no need for the ack variable anymore, you could directly work with the status. I would call intel_pmu_ack_status() as close as possible after the intel_pmu_get_status(), which is after 'again:'. -- Advanced Micro Devices, Inc. Operating System Research Center --
Yeah, I can do that. The other patch was just a proof of concept to see what others thought. What is funny is that this problem was masked by the perf_event_nmi_handler swallowing all the nmis. I wonder if we were losing events as a result of this bug too because if you think about it, we processed the first event, a second event came in and we accidentally ack'd it, thus dropping it on the floor. Now I wonder how the event was ever reloaded, unless it was by accident because of how the scheduler deals with perf counters (perf_start/stop all the time). Cheers, Don --
Yes, this could be the case, but only for handled counters. So it would be interesting to see for this case the status mask of the The nmi might be queued be the cpu regardless of of the overflow state. I am wondering why this happens at all, because events are disabled by wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0). Hmm, maybe this is exactly the reason because the nmi could fire again after reenabling the counters. Is there a reason for disabling all counters? -- Advanced Micro Devices, Inc. Operating System Research Center --
The status masks seem to be identical, 0x1 (and when I forced pmc0 Heh. Not sure why it isn't working then. Then again you shouldn't need It would be a nice to have that way we wouldn't have to 'eat' all these extra nmis. But I guess it isn't working correctly. Cheers, Don --
So this should also happen if only one counter is running? Back-to-back nmis actually only occur then 2 different counters What about the erratum mentioned in this thread before? We might identify affected cpus and return handled=2 for them. This solution will be still better than before. And, all other cpu models have the nmi detection fixed. In a next step we could try to use a timer for detection. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center --
Here is another version of the patch, which uses Robert's suggestion of
removing the ack variable
Cheers,
Don
--
From: Don Zickus <dzickus@redhat.com>
Date: Fri, 27 Aug 2010 14:43:03 -0400
Subject: [PATCH] [x86] perf: fix accidentally ack'ing a second event on intel perf counter
During testing of a patch to stop having the perf subsytem swallow nmis,
it was uncovered that Nehalem boxes were randomly getting unknown nmis
when using the perf tool.
Moving the ack'ing of the PMI closer to when we get the status allows
the hardware to properly re-set the PMU bit signaling another PMI was
triggered during the processing of the first PMI. This allows the new
logic for dealing with the shortcomings of multiple PMIs to handle the
extra NMI by 'eat'ing it later.
Now one can wonder why are we getting a second PMI when we disable all
the PMUs in the begining of the NMI handler to prevent such a case, for
that I do not know. But I know the fix below helps deal with this quirk.
Tested on multiple Nehalems where the problem was occuring. With the
patch, the code now loops a second time to handle the second PMI (whereas
before it was not).
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
arch/x86/kernel/cpu/perf_event_intel.c | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 4539b4b..ee05c90 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -712,7 +712,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
struct perf_sample_data data;
struct cpu_hw_events *cpuc;
int bit, loops;
- u64 ack, status;
+ u64 status;
int handled = 0;
perf_sample_data_init(&data, 0);
@@ -729,6 +729,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
loops = 0;
again:
+ intel_pmu_ack_status(status);
if (++loops > 100) {
WARN_ONCE(1, "perfevents: irq loop stuck!\n");
...Can you resend all three updated patches? Thanks Yinghai --
Looks good to me. Thanks Don. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center --
Ingo,
do you mean it does not trigger false positives on AMD? Both patches
applied on top of current tip/perf/urgent (c6db67c) are working on the
systems I have.
You might use the debug patch below for diagnostics.
-Robert
--
From 1bbb5aa64e96360529c34a593a072e1a84114f04 Mon Sep 17 00:00:00 2001
From: Robert Richter <robert.richter@amd.com>
Date: Wed, 11 Aug 2010 18:14:00 +0200
Subject: [PATCH] debug
Signed-off-by: Robert Richter <robert.richter@amd.com>
---
arch/x86/kernel/cpu/perf_event.c | 54 ++++++++++++++++++++++++++++++++++++-
1 files changed, 52 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index dd2fceb..059ef09 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1203,10 +1203,43 @@ void perf_events_lapic_init(void)
struct pmu_nmi_state {
unsigned int marked;
int handled;
+ u64 timestamp;
};
static DEFINE_PER_CPU(struct pmu_nmi_state, nmi);
+struct nmi_debug {
+ int cpu;
+ unsigned int this_nmi;
+ unsigned int marked;
+ int handled;
+ u64 timestamp;
+ u64 delta;
+};
+
+static DEFINE_PER_CPU(struct nmi_debug[16], nmi_debug);
+
+static void nmi_handler_debug(void)
+{
+ struct nmi_debug *debug;
+ int i;
+
+ if (!printk_ratelimit())
+ return;
+
+ for (i = 0; i < 16; i++) {
+ debug = &__get_cpu_var(nmi_debug)[i];
+ printk(KERN_EMERG
+ "cpu #%d, nmi #%d, marked #%d, handled = %d, time = %llu, delta = %llu\n",
+ debug->cpu,
+ debug->this_nmi,
+ debug->marked,
+ debug->handled,
+ debug->timestamp,
+ debug->delta);
+ }
+}
+
static int __kprobes
perf_event_nmi_handler(struct notifier_block *self,
unsigned long cmd, void *__args)
@@ -1214,6 +1247,8 @@ perf_event_nmi_handler(struct notifier_block *self,
struct die_args *args = __args;
unsigned int this_nmi;
int handled;
+ struct nmi_debug *debug;
+ u64 timestamp;
if ...Correct, on AMD boxes i do not get this message during the first 'perf Before the two patches i did not get these messages at all, on any of Thanks, will try that and report back. Ingo --
Here's a more detailed description of the regression introduced by: 4a31beb: perf, x86: Fix handle_irq return values 8e3e42b: perf, x86: Try to handle unknown nmis with an enabled PMU Booting into the debug kernel the system boots up fine - no NMI messages, as expected. Then when i start 'perf top' for the first time i get the NMI message with this debug output: cpu #15, nmi #160, marked #0, handled = 1, time = 333392635730, delta = 11238255 cpu #15, nmi #161, marked #0, handled = 1, time = 333403779380, delta = 11143650 cpu #15, nmi #162, marked #0, handled = 1, time = 333415418497, delta = 11639117 cpu #15, nmi #163, marked #0, handled = 1, time = 333415467084, delta = 48587 cpu #15, nmi #164, marked #0, handled = 1, time = 333415501531, delta = 34447 cpu #15, nmi #165, marked #0, handled = 1, time = 333459918106, delta = 44416575 cpu #15, nmi #166, marked #0, handled = 0, time = 333459923167, delta = 1666 cpu #15, nmi #151, marked #0, handled = 1, time = 332978597882, delta = 11447002 cpu #15, nmi #152, marked #0, handled = 1, time = 332978657151, delta = 59269 cpu #15, nmi #153, marked #0, handled = 1, time = 332978667847, delta = 10696 cpu #15, nmi #154, marked #0, handled = 1, time = 333023125757, delta = 44457910 cpu #15, nmi #155, marked #0, handled = 1, time = 333291980833, delta = 268855076 cpu #15, nmi #156, marked #0, handled = 1, time = 333325663125, delta = 33682292 cpu #15, nmi #157, marked #0, handled = 1, time = 333348216481, delta = 22553356 cpu #15, nmi #158, marked #0, handled = 1, time = 333370168887, delta = 21952406 cpu #15, nmi #159, marked #0, handled = 1, time = 333381397475, delta = 11228588 Uhhuh. NMI received for unknown reason 00 on CPU 15. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue When i start perf top for a second time, no messages are printed at all. The reason is that on one of the CPUs NMIs are 'stuck': NMI: 78164 67099 6342 [*] 65677 66119 63796 65395 ...
So I found a Nehalem box that can reliably reproduce Ingo's problem using something as simple 'perf top'. But like above, I am noticing the samething, an extra NMI(PMI??) that comes out of nowhere. Looking at the data above the delta between nmis is very small compared to the other nmis. It almost suggests that this is an extra PMI. Considering there is already two cpu errata discussing extra PMIs under certain configurations, I wouldn't be surprised if this was a third. Cheers, --
On Wed, Aug 25, 2010 at 04:11:06PM -0400, Don Zickus wrote: Oh. I'm not sure if it would be a good idea at all but maybe we could use kind of Robert's idea about "pmu nmi relaxing time" ie some time slice in which we treat nmi's as being from pmu, but not arbitrary number but equal to the number of PMI turned off. Say we handle NMI and found that 4 events are overflowed, we clear them, arm timer and wait for 3 unknow nmis to happen, if they are not happening during some time period we clear this waitqueue, if they happen or partially happen - we destroy the timer. Ie almost the same as Robert's idea but without tsc? Just a thought. -- Cyrill --
The only problem is only one counter is overflowing in these cases, so we would have to do it all the time, which may not be hard. But I was thinking of something similar. For now, I am trying to force counter0 off, seeing that most of the perf errata on nehalem have been on counter0. Or maybe I can get 'perf top' to use something other than counter0 by running 'perf record' first? Cheers, --
On Wed, Aug 25, 2010 at 05:20:37PM -0400, Don Zickus wrote: Well, I never looked deep inside perf tool code itself. Perhaps the easiest (if you want to disable counter0) is just disable it in kernel and see how it goes, no? -- Cyrill --
This could also be a race in the counter handling code, or we do not proper count the number of handled counters. Maybe 2 counters actually fired but we only noticed one counter and then accidentially cleared the 2nd without processing it. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center --
On Thu, Aug 26, 2010 at 1:00 PM, Robert Richter <robert.richter@amd.com> wrote: Any chance to get it tested on P4 machine since it has a bit different design? Cyrill --
Well P4 uses a different pmu irq handler, so I am not sure it will give us much insight. I haven't noticed this on an AMD or an intel i5 either. Cheers, Don --
Hmm, I take that back. I guess I can reproduce this on my i5 that I had using Ingo's config. Working on Robert's assumption, I added code to perf_event_intel.c that said if handled !=0 just add one to it (IOW always process handled as 0 or something >1). That seems to working good and catches the nmis that Ingo was seeing. I'll keep looking for the race condition to better fix it. Cheers, Don --
Sounds promising, mind to post new inter-diff? Ie what you have changed from Robert's patch. -- Cyrill --
It was really in Peter's patch, just a stupid hack for now. Cheers, Don diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 4539b4b..9e65a7b 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -777,7 +777,9 @@ again: done: intel_pmu_enable_all(0); - return handled; + if (!handled) + return handled; + return ++handled; } static struct event_constraint * --
ok, it seems it just treat any unknown nmi as being came from PMU, no? -- Cyrill --
Yes, this just throws away all unknown nmis after a perf nmi. It disables unknown nmi detection on this cpu type. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center --
Wouldn't returning 2 be more sensible, then it would only eat a few unknowns after each pmi? (Still assuming you return 0 when there really was nothing to do) --
Yes, this would be the best workaround for cpus where the detection logic does not work properly. But I think Don found a solution already. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center --
Yes, because all were eaten by NOTIFY_STOP, and that's why we also lose NMIs from other sources, like the nmi button. The patches try to fix this. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center --
I understand that, but obviously the patches need to fix the lack of messages without introducing new regressions: such as messages not seen before and CPUs with stuck NMIs. Thanks, Ingo --
Sad to say, that won't happen. Both my amd box and nehalem box have to many issues with your master branch. The amd box BUGs in perf_event_nmi_handler on the new code trying to run 'perf top' arch/x86/kernel/cpu/perf_event.c::perf_event_nmi_handler:1250 ((__get_cpu_var(nmi).marked == this_nmi) && The BUG is attached below. I can't figure out why And my Nehalem box won't even boot with the that kernel, not even to console for some reason. Then bisecting revealed that in 2.6.35 something with LVM changed such that the kernel can't mount my RHEL-6 lvm partitions. So even if I did get that kernel booting it won't mount disks. I'll take this as a sign to quit for now.. and try again on Monday. :-) Cheers, Don ----- amd-ma78gm-01.rhts.eng.bos.redhat.com login: BUG: unable to handle kernel paging request at ffff87ff838a5200 IP: [<ffffffff814a6370>] perf_event_nmi_handler+0xd0/0xe0 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 0 Modules linked in: autofs4 sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log ppdev parport_pc parport wmi snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr serio_raw edac_core edac_mce_amd sg i2c_piix4 r8169 mii ahci libahci shpchp ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif firewire_ohci firewire_core crc_itu_t ata_generic pata_acpi pata_atiixp floppy radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] Pid: 1865, comm: perf Not tainted 2.6.36-rc1tipperf-tip+ #28 GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[<ffffffff814a6370>] [<ffffffff814a6370>] perf_event_nmi_handler+0xd0/0xe0 RSP: 0018:ffff880002407e88 EFLAGS: 00010046 RAX: 0000000000000001 RBX: 000000000000000c RCX: ffffffff814a5200 RDX: ffffffff814a5200 RSI: 0000000000000001 RDI: ffff880002400000 RBP: ffff880002407e98 R08: 0000000000000001 ...
