Re: [PATCH] perf, x86: try to handle unknown nmis with running perfctrs

Previous thread: Re: [tip:x86/cpu] x86, cpu: RDC doesn't have CPUID, which is what c_ident is by bifferos on Wednesday, August 4, 2010 - 2:12 am. (3 messages)

Next thread: RE: [PATCH 1/2] mtdpart: memory accessor interface for MTD layer by David Brownell on Wednesday, August 4, 2010 - 3:31 am. (3 messages)
From: Lin Ming
Date: Wednesday, August 4, 2010 - 2:21 am

With nmi_watchdog enabled, perf_event_nmi_handler always return
NOTIFY_STOP(active_events > 0), and the notifier call chain will not
call further.

If it was not perf NMI, does the perf nmi handler may stop the real NMI
handler get called because NOTIFY_STOP is returned??

static int __kprobes
perf_event_nmi_handler(struct notifier_block *self,
			 unsigned long cmd, void *__args)
{
	struct die_args *args = __args;
	struct pt_regs *regs;

	if (!atomic_read(&active_events)) ===> With nmi_watchdog enabled, active_events > 0
		return NOTIFY_DONE;

	switch (cmd) {
	case DIE_NMI:
	case DIE_NMI_IPI:
		break;

	default:
		return NOTIFY_DONE;
	}

	regs = args->regs;

	apic_write(APIC_LVTPC, APIC_DM_NMI);
	/*
	 * Can't rely on the handled return value to say it was our NMI, two
	 * events could trigger 'simultaneously' raising two back-to-back NMIs.
	 *
	 * If the first NMI handles both, the latter will be empty and daze
	 * the CPU.
	 */
	x86_pmu.handle_irq(regs);

	return NOTIFY_STOP;
}

Thanks,
Lin Ming


--

From: Peter Zijlstra
Date: Wednesday, August 4, 2010 - 2:50 am

Urgh,.. right, so what is the alternative? we don't seem to have a
reliable way of telling where the NMI originated from.

As that comment says, the PMU can raise the NMI and raise the pending
NMI latch for a second over-run, at which point the first NMI will
likely see the overflow status for both, clear both, and the second NMI
will see a 0 overflow status, return it wasn't the PMU, but since the
PMU did raise it, nobody else will claim it, and we get these silly
dazed and confused thingies.

What NMI source are you concerned about and can it reliably tell if it
raised the NMI or not?


--

From: Robert Richter
Date: Wednesday, August 4, 2010 - 3:01 am

There is no general mechanism for recording the NMI source (except if
it was external triggered, e.g. by the southbridge). Also, all nmis
are mapped to NMI vector 2 and therefore there is no way to find out
the reason by using apic mask registers.

Now, if multiple perfctrs trigger an nmi, it may happen that a handler
has nothing to do because the counter was already handled by the
previous one. Thus, it is valid to have unhandled nmis caused by
perfctrs.

So, with counters enabled we always have to return stop for *all* nmis
as we cannot detect that it was an perfctr nmi. Otherwise we could
trigger an unhandled nmi. To ensure that all other nmi handlers are
called, the perfctr's nmi handler must have the lowest priority. Then,
the handler will be the last in the chain.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Peter Zijlstra
Date: Wednesday, August 4, 2010 - 3:24 am

Well, unless another NMI handler has the exact same issue and also
starts eating all NMIs, just in case.
--

From: Robert Richter
Date: Wednesday, August 4, 2010 - 3:29 am

In this case we will have to change the implementation for unhandled
nmis. But I don't know of other sources with this issue.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Don Zickus
Date: Wednesday, August 4, 2010 - 7:00 am

This is no different than a shared interrupt, no?  All the nmi handlers
need to check their own sources to see if they triggered it.  You can't

But the cases this break are, external NMI buttons, broken firmware that
causes SERRs on the PCI bus, and any other general hardware failures.

So what the perf handler does is really unacceptable.  The only reason we
are noticing this now is because I put the nmi_watchdog on top of the perf
subsystem, so it always has a user and will trigger NOTIFY_STOP.  Before,
it never had a registerd user so instead returned NOTIFY_DONE and
everything worked great.

Cheers,
Don
--

From: Peter Zijlstra
Date: Wednesday, August 4, 2010 - 7:11 am

Right so I looked up your thing and while that limits the damage in that
at some point it will let NMIs pass, it will still consume too many.
Meaning that Yinghai will have to potentially press his NMI button
several times before it registers.
--

From: Don Zickus
Date: Wednesday, August 4, 2010 - 7:52 am

Right, but that is because there is no bit that says the PMU generated the
nmi.  But for the most part checking to see if the PMU is >0 is good

Absolutely.  When a customer complains they upgraded their RHEL kernel and
the box suddenly hangs on boot trying to access the storage device, yes I
care.  Because a flood of NMIs would indiciate something is fishy with
the firmware (in this case it was a network card though it hung on storage
access).  Swallowing the NMIs would just cause everyone to waste weeks of
their time trying to figure it out (you don't want to know how many weeks
were wasted in RHEL-6 across multiple machines only to find out it was
broken firmware on a card that no one suspected as being the culprit).

As much as I hate broken firmware, it is becoming common place, and the
faster the kernel can point it out through unknown nmis, the faster we can

Ok.  Thanks for reviewing.  How does it consume to many?  I probably don't
understand how perf is being used in the non-simple scenarios.

Cheers,
Don

--

From: Peter Zijlstra
Date: Wednesday, August 4, 2010 - 8:02 am

Suppose you have 4 counters (AMD, intel-nhm+), when more than 2 overflow
the first will raise the PMI, if the other 2+ overflow before we disable
the PMU it will try to raise 2+ more PMIs, but because hardware only has
a single interrupt pending bit it will at most cause a single extra
interrupt after we finish servicing the first one.

So then the first interrupt will see 3+ overflows, return 3+, and will
thus eat 2+ NMIs, only one of which will be the pending interrupt,
leaving 1+ NMIs from other sources to consume unhandled.

In which case Yinghai will have to press his NMI button 2+ times before
it registers.

That said, that might be a better situation than always consuming
unknown NMIs.. 
--

From: Cyrill Gorcunov
Date: Wednesday, August 4, 2010 - 8:18 am

Well, first I guess having Yinghai CC'ed is a bonus ;)
The second thing is that I don't get why perf handler can't be _last_
call in default_do_nmi, if there were any nmi with reason (serr or parity)
I think they should be calling first which of course don't eliminate
the former issue but somewhat make it weaken.

	-- Cyrill
--

From: Don Zickus
Date: Wednesday, August 4, 2010 - 8:50 am

Because the reason registers are never set.  If they were, then the code
wouldn't have to walk the notify_chain. :-)

Unknown nmis are unknown nmis, nobody is claiming them.  Even worse, there
are customers that want to register their nmi handler below the perf
handler to claim all the unknown nmis, so they can be logged on the system
before being rebooted.

Cheers,
Don
--

From: Cyrill Gorcunov
Date: Wednesday, August 4, 2010 - 9:10 am

On Wed, Aug 04, 2010 at 11:50:02AM -0400, Don Zickus wrote:

maybe we're talking about different things. i meant that if there is nmi
with a reason (from 0x61) the handling of such nmi should be done before

well, perhaps we might need some kind of perf_chain in notifier code and
call for it after die_nmi chain, so the customers you mention may add own
	-- Cyrill
--

From: Don Zickus
Date: Wednesday, August 4, 2010 - 9:20 am

No we are talking about the same thing. :-)  And that code is already
there.  The problem is the bits in register 0x61 are not always set
correctly in the case of SERRs (well at least in all the cases I have
dealt with).  So you can easily can a flood of unknown nmis from an SERR
and register 0x61 would have the PERR/SERR bits set to 0.  Fun, huh?

Cheers,
Don

--

From: Cyrill Gorcunov
Date: Wednesday, August 4, 2010 - 9:39 am

On Wed, Aug 04, 2010 at 12:20:26PM -0400, Don Zickus wrote:


if there is nothing in nmi_sc the code flows into another branch. And
it hits the problem of perf events eating all nmi giving no chance the
others. So we take if (!(reason & 0xc0)) case and hit DIE_NMI_IPI
(/me scratching the head why it's not under CONFIG_X86_LOCAL_APIC) and
	-- Cyrill
--

From: Robert Richter
Date: Wednesday, August 4, 2010 - 11:48 am

(cc'ing Andi)


Only the upper 2 bits in io_61h indicate the nmi reason, so in case of
(!(reason & 0xc0)) the source simply can not be determined and all nmi
handlers in the chain must be called (DIE_NMI/DIE_NMI_IPI). The
perfctr handler then stops it.

So you can decide to either get an unrecovered nmi panic triggered by
a perfctr or losing unknown nmis from other sources. Maybe this can be
fixed by implementing handlers for those sources.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Andi Kleen
Date: Wednesday, August 4, 2010 - 12:22 pm

This is a tricky area. Me and Ying have been looking at this recently.

Hardware traditionally signals NMI when it has a uncontained error and really 
expects the OS to shut down to prevent data corruption spreading. i

Unfortunately especially for some older hardware
there can be cases where this is not expressed in port 61.
But the default behaviour of Linux for this today is quite wrong.

Some cases can be also determined with the help of APEI, which
can give you more information about the error (and tell you
if shutdown is needed).

But of course we can still have performance counter and other NMI
users.

So the right flow might be something like

- check software events (like crash dump or reboot)
- check perfctrs
- check APEI
- check port 61 for known events (it's probably a good idea
to check perfctrs first because accessing io ports is quite slow.
But the perfctr handler has to make sure it doesn't eat unknown
events, otherwise error handling would be impacted)
- check other event sources
- shutdown (depending on the chipset likely)

This means the NMI users who cannot determine themselves if a event
happened and eat everything (like oprofile today) would need to be fixed.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Cyrill Gorcunov
Date: Wednesday, August 4, 2010 - 12:26 pm

yes, that is what I meant by nmi_sc register. I think we need to restucturize
current default_do_nmi handler but how to be with perfs I don't know at moment
if perf register gets overflowed (ie already has pedning nmi) but we handle
	-- Cyrill
--

From: Robert Richter
Date: Thursday, August 5, 2010 - 11:52 pm

I was playing around with it yesterday trying to fix this. My idea is
to skip an unkown nmi if the privious nmi was a *handled* perfctr
nmi. I will probably post an rfc patch early next week.

Another problem I encountered is that unknown nmis from the chipset
are not reenabled, thus when hitting the nmi button I only see one
unknown nmi message per boot, if I reenable it, I get an nmi
storm firing nmi_watchdog. Uhh....

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Don Zickus
Date: Friday, August 6, 2010 - 7:21 am

You might want to add a little more logic that says *handled* _and_ had
more than one perfctr trigger.  Most of the time only one perfctr is
probably triggering, so you might be eating unknown_nmi's needlessly.


Interesting.

Cheers,
Don
--

From: Robert Richter
Date: Monday, August 9, 2010 - 12:48 pm

Here it comes:

From d2739578199d881ae6a9537c1b96a0efd1cdea43 Mon Sep 17 00:00:00 2001
From: Robert Richter <robert.richter@amd.com>
Date: Thu, 5 Aug 2010 16:19:59 +0200
Subject: [PATCH] perf, x86: try to handle unknown nmis with running perfctrs

When perfctrs are running it is valid to have unhandled nmis, two
events could trigger 'simultaneously' raising two back-to-back
NMIs. If the first NMI handles both, the latter will be empty and daze
the CPU.

The solution to avoid an 'unknown nmi' massage in this case was simply
to stop the nmi handler chain when perfctrs are runnning by stating
the nmi was handled. This has the drawback that a) we can not detect
unknown nmis anymore, and b) subsequent nmi handlers are not called.

This patch addresses this. Now, we drop this unknown NMI only if the
previous NMI was handling a perfctr. Otherwise we pass it and let the
kernel handle the unknown nmi. The check runs only if no nmi handler
could handle the nmi (DIE_NMIUNKNOWN case).

We could improve this further by checking if perf was handling more
than one counter. Otherwise we may pass the unknown nmi too.

Signed-off-by: Robert Richter <robert.richter@amd.com>
---
 arch/x86/kernel/cpu/perf_event.c |   39 +++++++++++++++++++++++++++++--------
 1 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f2da20f..c3cd159 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1200,12 +1200,16 @@ void perf_events_lapic_init(void)
 	apic_write(APIC_LVTPC, APIC_DM_NMI);
 }
 
+static DEFINE_PER_CPU(unsigned int, perfctr_handled);
+
 static int __kprobes
 perf_event_nmi_handler(struct notifier_block *self,
 			 unsigned long cmd, void *__args)
 {
 	struct die_args *args = __args;
 	struct pt_regs *regs;
+	unsigned int this_nmi;
+	unsigned int prev_nmi;
 
 	if (!atomic_read(&active_events))
 		return NOTIFY_DONE;
@@ -1214,7 +1218,26 @@ ...
From: Cyrill Gorcunov
Date: Monday, August 9, 2010 - 1:02 pm

If only I'm not missing something this apic_write should go up to
"case DIE_NMIUNKNOWN" site, no?

	-- Cyrill
--

From: Robert Richter
Date: Tuesday, August 10, 2010 - 12:42 am

This seems to be code from the non-nmi implementation and can be
removed at all, which should be a separate patch. The vector is
already set up.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Cyrill Gorcunov
Date: Tuesday, August 10, 2010 - 9:16 am

No, this is just a short way to unmask LVTPC (which is required for
cpus). Actually lookig into this snippet I found that in p4 pmu
I've made one redundant unmaksing operation. will update as only
this area settle down.

	-- Cyrill
--

From: Robert Richter
Date: Tuesday, August 10, 2010 - 9:41 am

The vector is setup in hw_perf_enable() and then never masked. The
perfctrs nmi is alwayes enabled since then. I still see no reason for
unmasking it again with every nmi. Once you handle the nmi it is also
enabled.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Cyrill Gorcunov
Date: Tuesday, August 10, 2010 - 10:24 am

It gets masked on NMI arrival, at least for some models (Core Duo, P4,
P6 M and I suspect more theh that, that was the reason oprofile has
	-- Cyrill
--

From: Robert Richter
Date: Tuesday, August 10, 2010 - 12:05 pm

Yes, that's right, I never noticed that. Maybe it is better to
implement the apic write it in model specific code then.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Cyrill Gorcunov
Date: Tuesday, August 10, 2010 - 12:24 pm

Perhaps we can make it simplier I think, ie like it was before -- we just
move it under your new DIE_NMIUNKNOWN, in separate patch of course. Though
I'm fine with either way.

(actually it's interesting to know wouldn't we leave lvt masked when
 we hit 'second delayed nmi has arrived' situation, I guess we didn't
	-- Cyrill
--

From: Robert Richter
Date: Thursday, August 12, 2010 - 6:24 am

I do not understand why you want to put this in the 'unknown'
path. Isn't it necessary to unmask the vector with every call of the
nmi handler?


-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Cyrill Gorcunov
Date: Thursday, August 12, 2010 - 7:31 am

Heh, it's simple - I'm screwed. Robert you're right, of course it should NOT
be under every 'unknown' nmi. I thought about small optimization here, but
I think all this should be done only _after_ your patch is merged.

Sorry for confuse ;)
 
	-- Cyrill
--

From: Don Zickus
Date: Tuesday, August 10, 2010 - 1:48 pm

On top of Robert's patch:
(compiled tested only because I don't have a fancy button to trigger
unknown nmis)

From 548cf5148f47618854a0eff22b1d55db71b6f8fc Mon Sep 17 00:00:00 2001
From: Don Zickus <dzickus@redhat.com>
Date: Tue, 10 Aug 2010 16:40:03 -0400
Subject: [PATCH] perf, x86: only skip NMIs when multiple perfctrs trigger

A small optimization on top of Robert's patch that limits the
skipping of NMI's to cases where we detect multiple perfctr events
have happened.

Signed-off-by: Don Zickus <dzickus@redhat.com>

---
 arch/x86/kernel/cpu/perf_event.c |   34 ++++++++++++++++++++--------------
 1 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index c3cd159..066046d 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1154,7 +1154,7 @@ static int x86_pmu_handle_irq(struct pt_regs *regs)
 		/*
 		 * event overflow
 		 */
-		handled		= 1;
+		handled		+= 1;
 		data.period	= event->hw.last_period;
 
 		if (!x86_perf_event_set_period(event))
@@ -1200,7 +1200,7 @@ void perf_events_lapic_init(void)
 	apic_write(APIC_LVTPC, APIC_DM_NMI);
 }
 
-static DEFINE_PER_CPU(unsigned int, perfctr_handled);
+static DEFINE_PER_CPU(unsigned int, perfctr_skip);
 
 static int __kprobes
 perf_event_nmi_handler(struct notifier_block *self,
@@ -1208,8 +1208,7 @@ perf_event_nmi_handler(struct notifier_block *self,
 {
 	struct die_args *args = __args;
 	struct pt_regs *regs;
-	unsigned int this_nmi;
-	unsigned int prev_nmi;
+	int handled = 0;
 
 	if (!atomic_read(&active_events))
 		return NOTIFY_DONE;
@@ -1229,14 +1228,11 @@ perf_event_nmi_handler(struct notifier_block *self,
 		 * was handling a perfctr. Otherwise we pass it and
 		 * let the kernel handle the unknown nmi.
 		 *
-		 * Note: this could be improved if we drop unknown
-		 * NMIs only if we handled more than one perfctr in
-		 * the previous NMI.
 		 */
-		this_nmi = ...
From: Frederic Weisbecker
Date: Tuesday, August 10, 2010 - 7:44 pm

Yeah, I think that's more reasonable. This lowers even more the chances of
losing important hardware errors.




May be make it just a pending bit. I mean not something that can
go further 1, because you can't have more than 1 pending anyway. I don't
know how that could happen you get accidental perctr_skip > 1, may be
expected pending NMIs that don't happen somehow, but better be paranoid with
that, as it's about trying not to miss hardware errors.

Thanks.

--

From: Robert Richter
Date: Wednesday, August 11, 2010 - 4:10 am

... but this will not work. You have to mark the *absolute* nmi number
here. If you only raise a flag, the next unknown nmi will be dropped,
every. Because, in between there could have been other nmis that
stopped the chain and thus the 'unknown' path is not executed. The
trick in my patch is that you *know*, which nmi you want to skip.

I will send an updated version of my patch.


-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Don Zickus
Date: Wednesday, August 11, 2010 - 5:44 am

I guess I am confused.  The way I read your patch was that you assumed the
next NMI would be the one you skip and if there was another NMI in between
the handled one and the one to skip, you would not skip it (nmi count !=
prev + 1) and it would produce an accidental unknown nmi.

I tried to change that with my patch by setting the skip flag which would
be drained on the next unknown nmi, independent of where it is in the NMI
backlog of NMIs.

Did I misread something?

Cheers,
Don
--

From: Robert Richter
Date: Wednesday, August 11, 2010 - 7:03 am

"setting the skip flag which would be drained on the next unknown nmi"

That's what is wrong, it drops every unknown nmi, no matter when it is
detected. In between there could be 1000's of valid other nmis
handled. You even could have been returned from nmi mode. But still,
the next unknown nmi will be dropped. Your patch could accumulate also
the number of unknown nmis to skip, and then, if 'real' unknown nmis
happen, all of them will be dropped.


-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Don Zickus
Date: Wednesday, August 11, 2010 - 7:32 am

Well as Frederic pointed out the skip variable will never go past one, so

That was the intent.  Can we guarantee that in the rare cases where the
perfctr is generating two nmis, that they will be back-to-back?

I think Huang tried to cap my approach even further my creating a time
window in which the two nmis had to happen.  That gives us the flexibility
to handle nmis that are not back to back, but yet deal with the case where
two perfctrs fired but we are unsure if it generated a second nmi and we
falsely set the skip flag.

Cheers,
Don

--

From: Frederic Weisbecker
Date: Thursday, August 12, 2010 - 9:37 pm

I'm not sure what you mean here. Are you thinking about a third
NMI source that triggers while we are still handling the first


Well with the flag you also know which nmi you want to skip.

--

From: Robert Richter
Date: Friday, August 13, 2010 - 1:22 am

We cannot assume that all cpus have the same behavior here. Imagine a
cpu that handles 2 counters and *does not* trigger a back-to-back
nmi. With flags only implemented, the next unknown nmi will be dropped
anyway, no matter when it fires. We have to check the nmi sequence.

The next thing you have to be aware is, a registered nmi handler is
not called with every nmi. If there was another nmi source and a
handler with higher priority was returning a stop, when all other
subsequent handlers are not called. Esp. 'unknown nmi' is called only
in rare cases. So, a handler might not get noticed of an nmi. This
means, if a handler gets called a 2nd time, it must not necessarily
the next (2nd) nmi.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Frederic Weisbecker
Date: Friday, August 13, 2010 - 6:28 pm

I'd expect it to be an ABI. NMIs can't nest, but if one triggers while
servicing another, it should trigger right after (once we "iret", which
reenables NMIs).



Yeah, in this case we can just clear the __ger_cpu_var(next_nmi_skip)
when another handler service the next NMI.

--

From: Robert Richter
Date: Friday, August 13, 2010 - 7:29 pm

Yes, nmis are nested and if there is another source firing it will be
retriggered. But the question is, if multiple counters trigger, does
this mean multiple nmis are fired, esp. if all counters were served in
the first run? This very much depends on the cpu implementation and it

Yes, this might work too. But then you end up at the same complexity.
Even worse, you have to check and unset the flag with each perf nmi.
If you store the nmi number, it will only be read in then 'unknown'
nmi path again. And, you can't unset the flag for nmis by other
sources what do not pass the perf nmi handler.

So, overall, I don't see advantages in using a flag. The
implementation of storing the nmi number is simple and straight
forward with no or less impact on performance or memory usage.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Don Zickus
Date: Wednesday, August 11, 2010 - 5:39 am

I guess I was thinking about the SMI case where it drains the perfctr(s)
and then retriggers them but I guess even in that case the most you can
have is one extra NMI.  So yeah, you are probably right, I should have
used a flag instead of incrementing.

Cheers,
Don
--

From: Huang Ying
Date: Tuesday, August 10, 2010 - 8:19 pm

You can trigger unknown NMIs via apic->send_IPI_mask(cpu_mask,
NMI_VECTOR).

How about the algorithm as follow:

int perf_event_nmi_handler()
{
	...
	switch (cmd) {
	case DIE_NMIUNKNOWN:
		if (per_cpu(perfctr_prev_handled) > 1
		    && rdtsc() - per_cpu(perfctr_handled_timestamp) < 1000)
			return NOTIFY_STOP;
		else
			return NOTIFY_DONE;
	}
	...
	handled = x86_pmu.handle_irq(regs);
	per_cpu(perfctr_prev_handled) = per_cpu(perfctr_handled);
	per_cpu(perfctr_handled) = handled;
	if (handled) {
		per_cpu(perfctr_handled_timestamp) = rdtsc();
		return NOTIFY_STOP;
	} else
		return NOTIFY_DONE;
}

Best Regards,
Huang Ying


--

From: Don Zickus
Date: Wednesday, August 11, 2010 - 5:36 am

Heh, I thought about the following too, just couldn't figure out an easy
way to timestamp.  I forgot about the rdtsc(). :-)

The only thing that might screw this up would be an SMI which takes longer
than 1000 but that should be rare and would probably be related to the
unknown NMI anyway.  Also under virt that would probably break (due to
time jumping) but until they emulate the perfctr, it won't matter. :-)

Cheers,
Don
--

From: Peter Zijlstra
Date: Monday, August 16, 2010 - 7:37 am

Long running SMIs aren't nearly as rare as you'd want them to be.
Hitting one in exactly the right spot will be, but given the numbers its
going to happen and make us scratch our heads..
--

From: Robert Richter
Date: Wednesday, August 11, 2010 - 3:00 pm

I was debuging this a little more, see version 2 below.

-Robert

--

From 8bb831af56d118b85fc38e0ddc2e516f7504b9fb Mon Sep 17 00:00:00 2001
From: Robert Richter <robert.richter@amd.com>
Date: Thu, 5 Aug 2010 16:19:59 +0200
Subject: [PATCH] perf, x86: try to handle unknown nmis with running perfctrs

When perfctrs are running it is valid to have unhandled nmis, two
events could trigger 'simultaneously' raising two back-to-back
NMIs. If the first NMI handles both, the latter will be empty and daze
the CPU.

The solution to avoid an 'unknown nmi' massage in this case was simply
to stop the nmi handler chain when perfctrs are runnning by stating
the nmi was handled. This has the drawback that a) we can not detect
unknown nmis anymore, and b) subsequent nmi handlers are not called.

This patch addresses this. Now, we check this unknown NMI if it could
be a perfctr back-to-back NMI. Otherwise we pass it and let the kernel
handle the unknown nmi.

This is a debug log:

cpu #6, nmi #32333, skip_nmi #32330, handled = 1, time = 1934364430
cpu #6, nmi #32334, skip_nmi #32330, handled = 1, time = 1934704616
cpu #6, nmi #32335, skip_nmi #32336, handled = 2, time = 1936032320
cpu #6, nmi #32336, skip_nmi #32336, handled = 0, time = 1936034139
cpu #6, nmi #32337, skip_nmi #32336, handled = 1, time = 1936120100
cpu #6, nmi #32338, skip_nmi #32336, handled = 1, time = 1936404607
cpu #6, nmi #32339, skip_nmi #32336, handled = 1, time = 1937983416
cpu #6, nmi #32340, skip_nmi #32341, handled = 2, time = 1938201032
cpu #6, nmi #32341, skip_nmi #32341, handled = 0, time = 1938202830
cpu #6, nmi #32342, skip_nmi #32341, handled = 1, time = 1938443743
cpu #6, nmi #32343, skip_nmi #32341, handled = 1, time = 1939956552
cpu #6, nmi #32344, skip_nmi #32341, handled = 1, time = 1940073224
cpu #6, nmi #32345, skip_nmi #32341, handled = 1, time = 1940485677
cpu #6, nmi #32346, skip_nmi #32347, handled = 2, time = 1941947772
cpu #6, nmi #32347, skip_nmi #32347, handled = 1, time = 1941949818
cpu ...
From: Robert Richter
Date: Thursday, August 12, 2010 - 6:10 am

I was testing the patch further, it properly filters perfctr
back-to-back nmis. I was able to reliable detect unknown nmis
triggered by the nmi button during high load perf sessions with

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Don Zickus
Date: Thursday, August 12, 2010 - 11:21 am

For my own curiousity, what type of high load perf sessions are you using
to test this.  I don't know perf well enough to have it generate events
across multiple perfctrs.

Thanks,
Don
--

From: Robert Richter
Date: Monday, August 16, 2010 - 12:37 am

You put load on all cpus and then start something like the following:

 perf record -e cycles -e instructions -e cache-references \
 	-e cache-misses -e branch-misses -a -- sleep 10

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Don Zickus
Date: Thursday, August 12, 2010 - 6:52 am

Seems to cover my concerns.  Great work!

ACK

Cheers,
Don
--

From: Frederic Weisbecker
Date: Thursday, August 12, 2010 - 9:25 pm

That time based thing looks a bit complicated.

I'm still not sure why you don't want to use a simple flag:

After handled a perf NMI:

	if (handled more than one counter)
		__get_cpu_var(skip_unknown) = 1;

While handling an unknown NMI:

	if (__get_cpu_var(skip_unknown)) {
		__get_cpu_var(skip_unknow) = 0;
		return NOTIFY_STOP;
	}

--

From: Peter Zijlstra
Date: Monday, August 16, 2010 - 7:48 am

I liked the one without funny timestamps in better, the whole timestamps
thing just feels too fragile.

Relying on handled > 1 to arm the back-to-back filter seems doable.

(Also, you didn't deal with the TSC going backwards..)
--

From: Cyrill Gorcunov
Date: Monday, August 16, 2010 - 9:27 am

On Mon, Aug 16, 2010 at 04:48:36PM +0200, Peter Zijlstra wrote:

Me too, the former Roberts patch (if I'm not missing something) looks good

It's doable _but_ I think there is nothing we can do, there is no
way (at least I known of) to check if there is latched nmi from
perf counters. We only can assume that if there multiple counters
overflowed most probably the next unknown nmi has the same nature,
ie it came from perf. Yes, we can loose real unknown nmi in this
case but I think this is justified trade off. If an user need
a precise counting of unknown nmis he should not arm perf events
at all, if there an user with nmi button (guys where did you get this
magic buttuns? i need one ;) he better to not arm perf events too
otherwise he might have to click twice

(and of course we should keep in mind Andi's proposal but it
	-- Cyrill
--

From: Robert Richter
Date: Monday, August 16, 2010 - 10:16 am

Peter, I will rip out the timestamp code from the -v2 patch. My first
patch does not deal with a 2-1-0 sequence, so it has false positives.
We do not necessarily need the timestamps if back-to-back nmis are
rare. Without using timestamps the statistically lost ratio for
unknown nmis will be as the ratio for back-to-back nmis, with
timestamps we could catch almost every unknown nmi. So if we encounter

As said, I think with timestamps we could be able to detect 100% of
the unknown nmis. I guess we get now more than 90% with mutliple
counters, and 100% with a single counter running. So, this is already

Yes, this patch is the first step, now we can change the nmi handler

Does this also happen in the case of a back-to-back nmi? I don't know
the conditions for a backward running TSC. Maybe, if an nmi is
retriggered the TSC wont be adjusted by a negative offset, I don't
know...

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Cyrill Gorcunov
Date: Monday, August 16, 2010 - 12:06 pm

Robert, I think we still may miss unknown irq, consider the case when
unknown nmi is latched while you handle nmi from perf and what is
more interesting several counters may be overflowed. So you set
delta small enough and second (unknown nmi) will be in range and

	-- Cyrill
--

From: Peter Zijlstra
Date: Monday, August 16, 2010 - 12:13 pm

Its not supposed to happen, but then there's BIOS failure-add that frobs
the TSC from SMIs and fun TSC artifacts around CPU frequency changes and
people resetting TSC in S-states etc..

In short, never trust the TSC to be even remotely sane.
--

From: Cyrill Gorcunov
Date: Monday, August 16, 2010 - 12:18 pm

ok, good to know, thanks!

	-- Cyrill
--

From: Robert Richter
Date: Monday, August 16, 2010 - 3:55 pm

Yes, that's true, but before you have to enable your Infinite
Improbability Drive.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Cyrill Gorcunov
Date: Tuesday, August 17, 2010 - 8:23 am

-- Cyrill
--

From: Robert Richter
Date: Tuesday, August 17, 2010 - 8:22 am

Peter,

this is version 3 without timestamp code. Compared to -v1 it
implements the handling for 2-1-0 nmi sequences.

-Robert

From 71a206394d8a3536b033247ec14ad037fef77236 Mon Sep 17 00:00:00 2001
From: Robert Richter <robert.richter@amd.com>
Date: Tue, 17 Aug 2010 16:42:03 +0200
Subject: [PATCH] perf, x86: try to handle unknown nmis with running perfctrs

When perfctrs are running it is valid to have unhandled nmis, two
events could trigger 'simultaneously' raising two back-to-back
NMIs. If the first NMI handles both, the latter will be empty and daze
the CPU.

The solution to avoid an 'unknown nmi' massage in this case was simply
to stop the nmi handler chain when perfctrs are runnning by stating
the nmi was handled. This has the drawback that a) we can not detect
unknown nmis anymore, and b) subsequent nmi handlers are not called.

This patch addresses this. Now, we check this unknown NMI if it could
be a perfctr back-to-back NMI. Otherwise we pass it and let the kernel
handle the unknown nmi.

This is a debug log:

cpu #6, nmi #32333, skip_nmi #32330, handled = 1, time = 1934364430
cpu #6, nmi #32334, skip_nmi #32330, handled = 1, time = 1934704616
cpu #6, nmi #32335, skip_nmi #32336, handled = 2, time = 1936032320
cpu #6, nmi #32336, skip_nmi #32336, handled = 0, time = 1936034139
cpu #6, nmi #32337, skip_nmi #32336, handled = 1, time = 1936120100
cpu #6, nmi #32338, skip_nmi #32336, handled = 1, time = 1936404607
cpu #6, nmi #32339, skip_nmi #32336, handled = 1, time = 1937983416
cpu #6, nmi #32340, skip_nmi #32341, handled = 2, time = 1938201032
cpu #6, nmi #32341, skip_nmi #32341, handled = 0, time = 1938202830
cpu #6, nmi #32342, skip_nmi #32341, handled = 1, time = 1938443743
cpu #6, nmi #32343, skip_nmi #32341, handled = 1, time = 1939956552
cpu #6, nmi #32344, skip_nmi #32341, handled = 1, time = 1940073224
cpu #6, nmi #32345, skip_nmi #32341, handled = 1, time = 1940485677
cpu #6, nmi #32346, skip_nmi #32347, handled = 2, time = 1941947772
cpu #6, nmi ...
From: Cyrill Gorcunov
Date: Tuesday, August 17, 2010 - 9:17 am

...

Looks good to me, thanks a lot Robert!

	-- Cyrill
--

From: Peter Zijlstra
Date: Thursday, August 19, 2010 - 3:45 am

I queued it with that part changed to:

+       this_nmi = percpu_read(irq_stat.__nmi_count);
+       if ((handled > 1) ||
+               /* the next nmi could be a back-to-back nmi */
+           ((__get_cpu_var(nmi).marked == this_nmi) &&
+            (__get_cpu_var(nmi).handled > 1))) {
+               /*
+                * We could have two subsequent back-to-back nmis: The
+                * first handles more than one counter, the 2nd
+                * handles only one counter and the 3rd handles no
+                * counter.
+                *
+                * This is the 2nd nmi because the previous was
+                * handling more than one counter. We will mark the
+                * next (3rd) and then drop it if unhandled.
+                */
+               __get_cpu_var(nmi).marked       = this_nmi + 1;
+               __get_cpu_var(nmi).handled      = handled;
+       }

        return NOTIFY_STOP;
 }


--

From: Robert Richter
Date: Thursday, August 19, 2010 - 5:39 am

I am fine with this. Thanks Peter.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: Don Zickus
Date: Thursday, August 19, 2010 - 7:12 am

I realized the other day this change doesn't cover the nehalem, core and p4
cases which use

intel_pmu_handle_irq
p4_pmu_handle_irq

as their handlers.  Though that patch can go on top of Robert's.

Cheers,
Don
--

From: Peter Zijlstra
Date: Thursday, August 19, 2010 - 7:27 am

Something like this?

---
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -713,6 +713,7 @@ static int intel_pmu_handle_irq(struct p
 	struct cpu_hw_events *cpuc;
 	int bit, loops;
 	u64 ack, status;
+	int handled = 0;
 
 	perf_sample_data_init(&data, 0);
 
@@ -743,12 +744,16 @@ again:
 	/*
 	 * PEBS overflow sets bit 62 in the global status register
 	 */
-	if (__test_and_clear_bit(62, (unsigned long *)&status))
+	if (__test_and_clear_bit(62, (unsigned long *)&status)) {
+		handled++;
 		x86_pmu.drain_pebs(regs);
+	}
 
 	for_each_set_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
 		struct perf_event *event = cpuc->events[bit];
 
+		handled++;
+
 		if (!test_bit(bit, cpuc->active_mask))
 			continue;
 
@@ -772,7 +777,7 @@ again:
 
 done:
 	intel_pmu_enable_all(0);
-	return 1;
+	return handled;
 }
 
 static struct event_constraint *
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_p4.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_p4.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_p4.c
@@ -673,7 +673,7 @@ static int p4_pmu_handle_irq(struct pt_r
 		if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
 			continue;
 
-		handled += overflow;
+		handled += !!overflow;
 
 		/* event overflow for sure */
 		data.period = event->hw.last_period;
@@ -690,7 +690,7 @@ static int p4_pmu_handle_irq(struct pt_r
 		inc_irq_stat(apic_perf_irqs);
 	}
 
-	return handled > 0;
+	return handled;
 }
 
 /*

--

From: Don Zickus
Date: Thursday, August 19, 2010 - 8:20 am

Looks correct.  I'll try to test the perf_event_intel.c path though I
haven't had much luck getting multiple events on the nehalem box I have
(the amd box was easy).

Cheers,
Don
--

From: Cyrill Gorcunov
Date: Thursday, August 19, 2010 - 10:43 am

On Thu, Aug 19, 2010 at 04:27:13PM +0200, Peter Zijlstra wrote:

No need to !! here, overflowed returns [0;1] though a small nit
	-- Cyrill
--

From: Peter Zijlstra
Date: Thursday, August 19, 2010 - 10:53 am

done.
--

From: Don Zickus
Date: Thursday, August 19, 2010 - 2:58 pm

Sorry I didn't notice this earlier, but I think you want that handled++

Cheers,
Don
--

From: Peter Zijlstra
Date: Friday, August 20, 2010 - 1:50 am

Only if there's any remaining set bits in the overflow status reg,
right?

But if we want to be paranoid, we should also check if the event that
generated the overflow actually had the INT flag enabled, I guess ;-)
--

From: Don Zickus
Date: Thursday, August 19, 2010 - 6:50 pm

I tested this patch and Robert's on an AMD box and Nehalem box.  Both
worked as intended.  However I did notice that whenever the AMD box
detected handled >1, it was shortly followed by an unknown_nmi that was
properly eaten with Robert's logic.  Whereas on the Nehalem box I saw a
lot of 'handled > 1' messages but very very few of them were followed by
an unknown_nmi message (and those messages that did come were properly
eaten).

Maybe that is just the differences in the cpu designs.

Of course I had to make the one change I mentioned previously for the
perf_event_intel.c file (moving the handled++ logic down a few lines).

I didn't run the test on a P4 box.

Looks great, thanks guys!

Cheers,
--

From: Ingo Molnar
Date: Friday, August 20, 2010 - 1:16 am

Please someone send the final version with a changelog, with all the acks and 
tested-by's added, so that i can send it Linuswards.

Thanks,

	Ingo
--

From: Peter Zijlstra
Date: Friday, August 20, 2010 - 3:04 am

I've got two patches queued, one from Robert (with a slight
modification, which removes a goto) and one from myself converting the
intel and p4 irq handler.

They're available at:

  programming.kicks-ass.net/sekrit/patches.tar.bz2

I did add Tested-by tags, although I suspect its not the exact same bits
Don tested..
--

From: Cyrill Gorcunov
Date: Friday, August 20, 2010 - 3:30 am

My Ack if needed
--

From: Don Zickus
Date: Friday, August 20, 2010 - 5:39 am

I can re-test those exact patches later today if needed.

Cheers,
Don
--

From: Ingo Molnar
Date: Friday, August 20, 2010 - 6:27 am

Please pick up -tip in an hour or two and holler if something looks wrong.

Thanks,

	Ingo
--

From: Don Zickus
Date: Friday, August 20, 2010 - 6:51 am

Easy enough.

Cheers,
Don
--

From: Ingo Molnar
Date: Friday, August 20, 2010 - 7:17 am

it's not working so well, i'm getting:

 Uhhuh. NMI received for unknown reason 00 on CPU 9.
 Do you have a strange power saving mode enabled?
 Dazed and confused, but trying to continue

on a nehalem box, after a perf top and perf stat run.

Thanks,

	Ingo
--

From: Cyrill Gorcunov
Date: Friday, August 20, 2010 - 1:45 pm

Might not this be the case of AAN93 erratum, which says the following

 | The processor can be configured to issue a PMI (performance monitor interrupt)
 | upon overflow of the IA32_FIXED_CTR0 MSR (309H). A single PMI should be observed on
 | overflow of IA32_FIXED_CTR0, however multiple PMIs are observed when this
 | erratum occurs.

Just a thought.

	-- Cyrill
--

From: Don Zickus
Date: Tuesday, August 24, 2010 - 2:48 pm

After applying the patch below, I ran the following commands on my nehalem
box without reproducing what you are seeing.  Any thing else I can run
that might trigger it? (I also ran them on an amd phenom quad-core box).

I used 2.6.32-rc2 plus Robert's and 2 of PeterZ's patches.

perf top
perf stat -a -e cycles -e instructions -e cache-references -e cache-misses -e branch-misses -- sleep 5
perf record -f -a -e cycles -e instructions -e cache-references -e cache-misses -e branch-misses -- sleep 5

Cheers,
Don


From 198be1044fa603bc9582a5c19134fdf9a433fff0 Mon Sep 17 00:00:00 2001
From: Don Zickus <dzickus@redhat.com>
Date: Tue, 24 Aug 2010 17:43:17 -0400
Subject: [PATCH] [x86] perf: rename nmi variable to avoid clash with entry point

There is already an entry point named .nmi in entry.S and that seems to clash
with the per_cpu variable nmi defined in commit f3a860d8.  Renaming this
variable avoids the namespace collision.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/cpu/perf_event.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index dd2fceb..2a05ea4 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1205,7 +1205,7 @@ struct pmu_nmi_state {
 	int		handled;
 };
 
-static DEFINE_PER_CPU(struct pmu_nmi_state, nmi);
+static DEFINE_PER_CPU_PAGE_ALIGNED(struct pmu_nmi_state, pmu_nmi);
 
 static int __kprobes
 perf_event_nmi_handler(struct notifier_block *self,
@@ -1224,7 +1224,7 @@ perf_event_nmi_handler(struct notifier_block *self,
 		break;
 	case DIE_NMIUNKNOWN:
 		this_nmi = percpu_read(irq_stat.__nmi_count);
-		if (this_nmi != __get_cpu_var(nmi).marked)
+		if (this_nmi != __get_cpu_var(pmu_nmi).marked)
 			/* let the kernel handle the unknown nmi */
 			return NOTIFY_DONE;
 		/*
@@ -1248,8 +1248,8 @@ perf_event_nmi_handler(struct notifier_block *self,
 	this_nmi = ...
From: Robert Richter
Date: Friday, August 20, 2010 - 1:36 am

Don, thanks for testing this.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

--

From: tip-bot for Peter Zijlstra
Date: Friday, August 20, 2010 - 7:17 am

Commit-ID:  4a31bebe71ab2e4f6ecd6e5f9f2ac9f0ff38ff76
Gitweb:     http://git.kernel.org/tip/4a31bebe71ab2e4f6ecd6e5f9f2ac9f0ff38ff76
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Thu, 19 Aug 2010 16:28:00 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 20 Aug 2010 15:00:06 +0200

perf, x86: Fix handle_irq return values

Now that we rely on the number of handled overflows, ensure all handle_irq
implementations actually return the right number.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Don Zickus <dzickus@redhat.com>
LKML-Reference: <1282228033.2605.204.camel@laptop>
---
 arch/x86/kernel/cpu/perf_event_intel.c |    9 +++++++--
 arch/x86/kernel/cpu/perf_event_p4.c    |    2 +-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index d8d86d0..4539b4b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -713,6 +713,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 	struct cpu_hw_events *cpuc;
 	int bit, loops;
 	u64 ack, status;
+	int handled = 0;
 
 	perf_sample_data_init(&data, 0);
 
@@ -743,12 +744,16 @@ again:
 	/*
 	 * PEBS overflow sets bit 62 in the global status register
 	 */
-	if (__test_and_clear_bit(62, (unsigned long *)&status))
+	if (__test_and_clear_bit(62, (unsigned long *)&status)) {
+		handled++;
 		x86_pmu.drain_pebs(regs);
+	}
 
 	for_each_set_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
 		struct perf_event *event = cpuc->events[bit];
 
+		handled++;
+
 		if (!test_bit(bit, cpuc->active_mask))
 			continue;
 
@@ -772,7 +777,7 @@ again:
 
 done:
 	intel_pmu_enable_all(0);
-	return 1;
+	return handled;
 }
 
 static struct event_constraint *
diff --git a/arch/x86/kernel/cpu/perf_event_p4.c b/arch/x86/kernel/cpu/perf_event_p4.c
index febb12c..d470c91 100644
--- a/arch/x86/kernel/cpu/perf_event_p4.c
+++ ...
From: tip-bot for Robert Richter
Date: Friday, August 20, 2010 - 7:17 am

Commit-ID:  8e3e42b4f88602bb6e1f0b74afd80ff4703f79ce
Gitweb:     http://git.kernel.org/tip/8e3e42b4f88602bb6e1f0b74afd80ff4703f79ce
Author:     Robert Richter <robert.richter@amd.com>
AuthorDate: Tue, 17 Aug 2010 16:42:03 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 20 Aug 2010 15:00:05 +0200

perf, x86: Try to handle unknown nmis with an enabled PMU

When the PMU is enabled it is valid to have unhandled nmis, two
events could trigger 'simultaneously' raising two back-to-back
NMIs. If the first NMI handles both, the latter will be empty and daze
the CPU.

The solution to avoid an 'unknown nmi' massage in this case was simply
to stop the nmi handler chain when the PMU is enabled by stating
the nmi was handled. This has the drawback that a) we can not detect
unknown nmis anymore, and b) subsequent nmi handlers are not called.

This patch addresses this. Now, we check this unknown NMI if it could
be a PMU back-to-back NMI. Otherwise we pass it and let the kernel
handle the unknown nmi.

This is a debug log:

cpu #6, nmi #32333, skip_nmi #32330, handled = 1, time = 1934364430
cpu #6, nmi #32334, skip_nmi #32330, handled = 1, time = 1934704616
cpu #6, nmi #32335, skip_nmi #32336, handled = 2, time = 1936032320
cpu #6, nmi #32336, skip_nmi #32336, handled = 0, time = 1936034139
cpu #6, nmi #32337, skip_nmi #32336, handled = 1, time = 1936120100
cpu #6, nmi #32338, skip_nmi #32336, handled = 1, time = 1936404607
cpu #6, nmi #32339, skip_nmi #32336, handled = 1, time = 1937983416
cpu #6, nmi #32340, skip_nmi #32341, handled = 2, time = 1938201032
cpu #6, nmi #32341, skip_nmi #32341, handled = 0, time = 1938202830
cpu #6, nmi #32342, skip_nmi #32341, handled = 1, time = 1938443743
cpu #6, nmi #32343, skip_nmi #32341, handled = 1, time = 1939956552
cpu #6, nmi #32344, skip_nmi #32341, handled = 1, time = 1940073224
cpu #6, nmi #32345, skip_nmi #32341, handled = 1, time = 1940485677
cpu #6, nmi #32346, skip_nmi #32347, handled = 2, time = 1941947772
cpu #6, nmi ...
From: Andi Kleen
Date: Friday, August 6, 2010 - 8:35 am

Some of this can be handled by APEI on newer systems (if the platform
supports that).

But not all unfortunately if you consider older systems.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Don Zickus
Date: Wednesday, August 4, 2010 - 8:45 am

Well if the worse case scenario is only one extra NMI, then I can change
the logic in my patch to eat a max of 1 possible NMI instead of 2 as in the
example you gave above.

It still won't be 100% accurate but how often are people running perf
where they need 4 counters, have to hit an external nmi button or run into
broken firmware all at the same time? :-)

Cheers,
Don

--

From: Andi Kleen
Date: Friday, August 6, 2010 - 8:37 am

One alternative would be to stop using NMIs for perf counters in common cases.

If you have PEBS and your events support PEBS then PEBS can give you a
lot of information inside the irq off region. That works for common
events at least.

Also traditionally interrupt off regions are shrinking in Linux,
so is it really still worth all the trouble just to profile inside them.

e.g. one could make nmi profiling an option with default off.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Don Zickus
Date: Wednesday, August 4, 2010 - 6:54 am

Yes I sent a cheap and dirty patch to address this a couple of weeks ago

http://lkml.indiana.edu/hypermail//linux/kernel/1007.2/02590.html

Unfortunately, no responded. :-(  Of course, it could have been so gross
no one wanted to comment on it. :-)

Cheers,
Don
--

Previous thread: Re: [tip:x86/cpu] x86, cpu: RDC doesn't have CPUID, which is what c_ident is by bifferos on Wednesday, August 4, 2010 - 2:12 am. (3 messages)

Next thread: RE: [PATCH 1/2] mtdpart: memory accessor interface for MTD layer by David Brownell on Wednesday, August 4, 2010 - 3:31 am. (3 messages)