First kernel:
[ 1.139418] calling init_hw_perf_events+0x0/0xb77 @ 1
[ 1.159111] Performance Events: PEBS fmt1+, Nehalem events, Intel PMU
driver.
[ 1.159567] ... version: 3
[ 1.179121] ... bit width: 48
[ 1.179353] ... generic registers: 4
[ 1.179593] ... value mask: 0000ffffffffffff
[ 1.199211] ... max period: 000000007fffffff
[ 1.199554] ... fixed-purpose events: 3
[ 1.219108] ... event mask: 000000070000000f
[ 1.219454] initcall init_hw_perf_events+0x0/0xb77 returned 0 after
11719 usecs
.....
[ 20.220997] checking TSC synchronization [CPU#0 -> CPU#11]: passed.
[ 20.260818] NMI watchdog enabled, takes one hw-pmu counter.
kexeced kernel.
[ 1.169470] calling init_hw_perf_events+0x0/0xb77 @ 1
[ 1.189265] Performance Events: PEBS fmt1+, Nehalem events, Broken
PMU hardware detected, software events only.
...
[ 21.010407] NMI watchdog failed to create perf event on cpu14:
fffffffffffffffe
caused by:
commit 33c6d6a7ad0ffab9b1b15f8e4107a2af072a05a0
Author: Don Zickus <dzickus@redhat.com>
Date: Mon Nov 22 16:55:23 2010 -0500
x86, perf, nmi: Disable perf if counters are not accessible
In a kvm virt guests, the perf counters are not emulated. Instead they
return zero on a rdmsrl. The perf nmi handler uses the fact that
crossing
a zero means the counter overflowed (for those counters that do not have
specific interrupt bits). Therefore on kvm guests, perf will swallow all
NMIs thinking the counters overflowed.
This causes problems for subsystems like kgdb which needs NMIs to do its
magic. This problem was discovered by running kgdb tests.
The solution is to write garbage into a perf counter during the
initialization and hopefully reading back the same number. On kvm
guests, the value will be read back as zero and we disable perf as
a result.
Reported-by: Jason Wessel ...*sigh*, and people ask me why kexec/kdump are such bad ideas.. apparently kexec doesn't properly shut down the first kernel and leaves a counter running, then when we write and read the counter value they don't match because its still running and voila, crap happens. I've CC'ed the kexec people, maybe they got clue as to how to sort this. --
So we can shutdown counters while first kernel is going down. Is there a simple function already which I can call? Thanks Vivek --
Dunno, the cpu hotplug stuff should suffice I think, but then I don't think you actually unplug the boot cpu. What does kexec normally do to ensure hardware is left in a sane state? --
Typically calls device_shutdown() and sysdev_shutdown() from kernel_restart_prepare() to shutdown the devices. Also calls machine_shutdown() which depending on architecture can take care of various things like stopping other cpus, shutting down LAPIC, disabling IOAPIC, disabling hpet, shutting down IOMMU etc (native_machine_shutdown()). Thanks Vivek --
So basically there's no sane generic reset callout? --
I think ->shutdown() calls are sane generic callouts. Isn't it? There seem to be few exceptions for LAPIC, IOMMU and HPET and I am not sure why they are not covered by shutdown calls. CCing Eric, he might have more insight into it. Thanks Vivek --
->shutdown looks like it's about to reset/halt the hardware, no point in slowing down the regular shutdown/reboot path for something like this, That's all arch specific, but even there I don't think the reset code should live outside of kexec. --
I think we already call ->shutdown() in regular reboot path.
kernel_restart()
kernel_restart_prepare()
device_shutdown();
sysdev_shutdown();
So it should not make lot of difference if perf subsystem/counters are
I would not know the history but I have heard stories that if you don't
shutdown the hardware over restart, BIOS might not be expecting it and
might get trumped.
Thanks
Vivek
--
Oh, but I'm not a device or sysdev thing, I'll never get something like Never yet had a problem with that. --
There is also the reboot notifier, if the NMI needs to be controlled I haven't personally but I have certainly heard stories and seen debugging sessions where some devices work or don't depending on the order of running linux and windows on a machine, with soft reboots in between. Eric --
I tried reboot notifiers with the nmi_watchdog and acheived some success
(on a Westmere box, a P4 still failed). Kdump is still screwed, but maybe
we don't care for now.
Here is the quick and dirty patch I used.
Cheers,
Don
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 792a4ed..3455cf9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -23,6 +23,7 @@
#include <linux/notifier.h>
#include <linux/module.h>
#include <linux/sysctl.h>
+#include <linux/reboot.h>
#include <asm/irq_regs.h>
#include <linux/perf_event.h>
@@ -550,6 +551,18 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
.notifier_call = cpu_callback
};
+static int __cpuinit
+reboot_callback(struct notifier_block *nfb, unsigned long action, void *unused)
+{
+ watchdog_disable_all_cpus();
+
+ return notifier_from_errno(0);
+}
+
+static struct notifier_block __cpuinitdata reboot_nfb = {
+ .notifier_call = reboot_callback
+};
+
void __init lockup_detector_init(void)
{
void *cpu = (void *)(long)smp_processor_id();
@@ -563,6 +576,7 @@ void __init lockup_detector_init(void)
cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
register_cpu_notifier(&cpu_nfb);
+ register_reboot_notifier(&reboot_nfb);
return;
}
--
We'd really want a perf_event.c callback there to do as the hot-unplug code does and detach all running counters from the cpu. --
Ok, I moved the reboot notifier stuff from kernel/watchdog.c to kernel/perf_event.c. Things still worked fine from a kexec perspective. Vivek suggested to me this morning that I should just blantantly disable the perf counter during init when running my test. Looking through the code I don't think I can do this using disable_all because some routines look for the active bit to be set and some arches have different disable registers than others. Thoughts? Cheers, Don --
Nah, we should actively scan for that during the bring-up and kill
hw-perf when we find an enable bit set, some BIOSes actively use the
PMU, this is something that should be discouraged.
---
arch/x86/kernel/cpu/perf_event.c | 30 +++++++++++++++++++++++++++---
1 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 817d2b1..7f92833 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -375,15 +375,40 @@ static void release_pmc_hardware(void) {}
static bool check_hw_exists(void)
{
u64 val, val_new = 0;
- int ret = 0;
+ int i, reg, ret = 0;
val = 0xabcdUL;
ret |= checking_wrmsrl(x86_pmu.perfctr, val);
ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
- if (ret || val != val_new)
+ if (ret || val != val_new) {
+ printk(KERN_CONT "Broken PMU hardware detected, software events only.\n");
return false;
+ }
+
+ /*
+ * Check to see if the BIOS enabled any of the counters, if so
+ * complain and bail.
+ */
+ for (i = 0; i < x86_pmu.num_counters; i++) {
+ reg = x86_pmu.eventsel + i;
+ rdmsrl(reg, val);
+ if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
+ goto bios_fail;
+ }
+
+ for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
+ reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
+ rdmsrl(reg, val);
+ if (val & (0x03 << i*4))
+ goto bios_fail;
+ }
return true;
+
+bios_fail:
+ printk(KERN_CONT "Broken BIOS detected, software events only.\n");
+ printk(KERN_ERR FW_BUG "invalid MSR %x=%Lx\n", reg, val);
+ return false;
}
static void reserve_ds_buffers(void);
@@ -1379,7 +1404,6 @@ int __init init_hw_perf_events(void)
/* sanity check that the hardware exists or is emulated */
if (!check_hw_exists()) {
- pr_cont("Broken PMU hardware detected, software events only.\n");
return 0;
Something like the below, preferably I'd key that off of SYS_KEXEC, but
looking through the existing notifiers adding a state requires ...Ok, the reboot notifier addresses the kexec problem but doesn't fix it though (I have to test to confirm that, comments below). The bios check should catch those situations (ironically I stumbled upon a machine with this problem, so I will test your patch with it, though it only uses perf counter 0). The kdump problem will still exist, not sure if we care and perhaps we should document in the changelog that we know kdump is still I wonder if you should reverse these checks. If the bios has the perf counter enabled, there might be a high chance that it fails the first Ok, so this shuts down the perf counters on cpu0, but the other cpus are still running and will fail your new bios check, no? Privately, I used the above wrapped with for_each_online_cpu(cpu) and it worked fine for me. Cheers, Don --
Right, they usually only steal one or two counters, but the fact that You mean even if we cure the kexec reboot notifier patch thing kdump is Oh, so reboot doesn't actually stop the non-boot cpus? I was unsure of that (see my XXX there), so yeah, if it doesn't then I guess the for_each_possible_cpu() thing is the way out. --
Yes. reboot notifier notifications are not sent in kdump path. In this path we know kernel has crashed and we just try to do bare minimal things to boot into second kernel. If some hardware is left in inconsistent state we try to recover from that situation by resetting the device when second kernel is booting. Either driver itself can detect that device is in inconsistent state and reset it otherwise we also pass a command line parameter "reset_devices" to second kernel to explicitly tell kernel that devices might be in bad state, reset these during initialization. If we want to use these perf counters in kdump kernel, we shall have to do something similar. Thanks Vivek --
Right, so I'm perfectly fine with leaving the kdump kernel broken for now and if people really do need hardware events we can try and reset the hardware when we find that reset_devices command line parameter. Not sure how that interacts with these broken BIOSes, but its kdump so its mostly broken by design anyway ;-) --
reset_devices was meant to be dual purpose so that it can handle broken BIOSes also. So if BIOS is broken then one can pass "reset_devices" to Kdump has its share of problems especially with the fact that kernel/drivers find devices in bad state and are not hardened enough to deal with that. But on bare metal what's the better way of capturing kernel crash dump? Trying to do anything post crash in the kernel is also not very reliable either. I think the way we fix kernel for boot problems on newer hardware, for broken BIOses, we need to keep on fixing it in kdump path also to make sure new devices/drivers can cope up with this scenario. Thanks Vivek --
/me <3 RS-232 I haven't found anything better than that... And poking at the RS-232 requires less of the kernel to be functional than booting into a new kernel (whose image might have been corrupted by the dying kernel, etc..) --
Serial is good for getting the oops out. But for the big vmcore? Secondly, people want the flexibility of sending the vmcore over various targets like over network to some remote server. Booting into second kernel opens up all those options and now one can do intelligent filtering and send New kernel image being corrupted problem can be solved up to great extent by write protecting that memory location. So those who are happy with RS-232, they don't have to configure kdump. Just connect serial console and get the oops message out. Thanks Vivek --
True. But it can be a pain to operate RS-232 at production scale, or to convince customers to hook up RS-232 just in case your released software For debugging a reproducible failure RS-232 wins. For everything else there is kdump. It sucks but it is at least fixable. And really the kdump kernel should be running a minimalistic hardware config so you only have to get the chunks of hardware you really care about working. As for corruption the kdump kernel lives in an area of memory that we never DMA to in the primary kernel, and we check a sha256 hash before we start booting the kdump kernel. In general kdump fails safe. That is if it can't makes things work it fails to boot and does nothing to your system. Definitely not perfect but if you don't have RS-232 it is the best I have seen. Eric --
Something like so..
---
Subject: perf, x86: Detect broken BIOSes
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Dec 08 15:56:23 CET 2010
Some BIOSes use PMU resources, this is a bug.
Try to detect this, warn about it, and further refuse to touch the
PMU ourselves.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -375,15 +375,51 @@ static void release_pmc_hardware(void) {
static bool check_hw_exists(void)
{
u64 val, val_new = 0;
- int ret = 0;
+ int i, reg, ret = 0;
+ /*
+ * Check to see if the BIOS enabled any of the counters, if so
+ * complain and bail.
+ */
+ for (i = 0; i < x86_pmu.num_counters; i++) {
+ reg = x86_pmu.eventsel + i;
+ ret = rdmsrl_safe(reg, &val);
+ if (ret)
+ goto msr_fail;
+ if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
+ goto bios_fail;
+ }
+
+ for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
+ reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
+ ret = rdmsrl_safe(reg, &val);
+ if (ret)
+ goto msr_fail;
+ if (val & (0x03 << i*4))
+ goto bios_fail;
+ }
+
+ /*
+ * Now write a value and read it back to see if it matches,
+ * this is needed to detect certain hardware emulators (qemu/kvm)
+ * that don't trap on the MSR access and always return 0s.
+ */
val = 0xabcdUL;
- ret |= checking_wrmsrl(x86_pmu.perfctr, val);
+ ret = checking_wrmsrl(x86_pmu.perfctr, val);
ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
if (ret || val != val_new)
- return false;
+ goto msr_fail;
return true;
+
+bios_fail:
+ printk(KERN_CONT "Broken BIOS detected, software events only.\n");
+ printk(KERN_ERR FW_BUG "invalid MSR: %x=%Lx\n", reg, val);
+ return false;
+
+msr_fail:
+ printk(KERN_CONT "Broken PMU hardware detected, software events only.\n");
+ return false;
}
static ...can you add sth force_... in command line to take over ownership of perf from BIOS or previous kernel ? then still can use perf etc after we kexec from RHEL or SLES kernel to later kernel ( from 2.6.37) Thanks --
The problem is, you cannot steal the thing from the BIOS, you'll trample on its settings and the next time it runs it will simply re-instate it. And aside from probing the EN bit on boot there is no way of determining this. I'm not sure why people would do that, but yeah I guess we can do something like that. --
more problem: system with linuxbios and have kernel in flash as bootloader. they may kexec to final production kernel. and they may need to update that embedded kernel to shutdown perf.... Yinghai --
how about second case: kexec from RHEL 6 stock kernel to upstream kernel ? Thanks Yinghai --
Its impossible to distinguish between a BIOS having claimed a counter and a previous kernel not having shut things down properly. The best we can do is allow a force parameter and let the user keep all pieces when he uses it. --
My understand is that you can't because the BIOS is actively using it behind the scenes of the kernel (well during an SMI). I have a machine where I tried to force take it but it still stopped triggering interrupts. Cheers, Don --
This seems to work correctly on my Nehalem and broken bios machines during boot and kexec. As expected it fails during kdump. My p4 box failed during kexec for some reason. But p4 has other issues. Cheers, --
Does the kdump kernel still boot? It looks like it should I just want to double check. Eric --
Yeah, sorry for not being clear. It definitely boots and does it thing. perf init (and thus nmi watchdog) fail with 'BIOS broken' because the perf counters were not shutdown prior to executing kdump. Cheers, Don --
Getting closer... if (x86_pmu.perfctr_second_write) ret |= checking_wrmsrl(x86_pmu.perfctr, val); Cheers, Don --
On Thu, Dec 09, 2010 at 03:20:08PM -0500, Don Zickus wrote: ... yeah, thanks! would you push a patch upstream? Cyrill --
Something like so then?
---
Subject: perf: Stop all counters on reboot
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Dec 08 15:29:02 CET 2010
Use the reboot notifier to detach all running counters on reboot, this
solves a problem with kexec where the new kernel doesn't expect
running counters (rightly so).
It will however decrease the coverage of the NMI watchdog. Making a
kexec specific reboot notifier callback would be best, however that
would require touching all notifier callback handlers as they are not
properly structured to deal with new state.
As a compromise, place the perf reboot notifier at the very last
position in the list.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -21,6 +21,7 @@
#include <linux/dcache.h>
#include <linux/percpu.h>
#include <linux/ptrace.h>
+#include <linux/reboot.h>
#include <linux/vmstat.h>
#include <linux/vmalloc.h>
#include <linux/hardirq.h>
@@ -6329,7 +6330,7 @@ static void __cpuinit perf_event_init_cp
mutex_unlock(&swhash->hlist_mutex);
}
-#ifdef CONFIG_HOTPLUG_CPU
+#if defined CONFIG_HOTPLUG_CPU || defined CONFIG_KEXEC
static void perf_pmu_rotate_stop(struct pmu *pmu)
{
struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
@@ -6383,6 +6384,26 @@ static void perf_event_exit_cpu(int cpu)
static inline void perf_event_exit_cpu(int cpu) { }
#endif
+static int
+perf_reboot(struct notifier_block *notifier, unsigned long val, void *v)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ perf_event_exit_cpu(cpu);
+
+ return NOTIFY_OK;
+}
+
+/*
+ * Run the perf reboot notifier at the very last possible moment so that
+ * the generic watchdog code runs as long as possible.
+ */
+static struct notifier_block perf_reboot_notifier = {
+ .notifier_call = perf_reboot,
+ .priority = ...Can't think why would somebody like to use performance counters in kdump kernel. So that probably should not be a concern. Vivek --
Ok, here is a simpler patch for now. --------------------------------8<-------- From: Don Zickus <dzickus@redhat.com> Date: Tue, 7 Dec 2010 16:06:59 -0500 Subject: [PATCH] perf: Use event select bits for hardware check The counter registers can continue to increment if left enabled across a kexec or a kdump. The makes the perf hardware check accidentally return false when the hardware really does exist. Change the check to use the first bits of event selection. Those bits should be safe as they are used to program the type of events to use. And more importantly, they won't increment across kexec/kdump. Signed-off-by: Don Zickus <dzickus@redhat.com> --- arch/x86/kernel/cpu/perf_event.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 7b91396..7d869c0 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -377,10 +377,10 @@ static bool check_hw_exists(void) u64 val, val_new = 0; int ret = 0; - val = 0xabcdUL; - ret |= checking_wrmsrl(x86_pmu.perfctr, val); - ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new); - if (ret || val != val_new) + val = 0xabUL; + ret |= checking_wrmsrl(x86_pmu.eventsel, val); + ret |= rdmsrl_safe(x86_pmu.eventsel, &val_new); + if (ret || val != (val_new & 0xFF)) return false; return true; -- 1.7.3.2 --
Won't merge it though, I think it stinks.. --
No you don't! Most BIOSen implement a board level reset there, but it isn't required. Just doing a software only reinitialization is allowed, and on some arches is the only thing you can do. Speed during reboot is not a reason to avoid anything. reboot is not a fast path, and we are talking about things in human tersm. The only argument I have heard that holds the least amount of sense is to keep what we do to a minimum, to increase the chances that we can do a reboot even after a kernel oops. All of that said. What insane start are we leaving the hardware in that we think it is going to be slow in human terms to remove? Eric --
