Re: nmi_watchdog fix for x86_64 to be more like i386

Previous thread: Proposed 2.6 Patch for AMD MIPS Alchemy au1550 I2C interface I2C interface by Chris David on Wednesday, September 26, 2007 - 10:48 am. (1 message)

Next thread: forcedeth question by roel on Wednesday, September 26, 2007 - 11:35 am. (1 message)
From: David Bahi
Date: Wednesday, September 26, 2007 - 11:03 am

Thanks to tglx and ghaskins for all the help in tracking down a very
early nmi_watchdog crash on certain x86_64 machines.

From: Andi Kleen
Date: Monday, October 1, 2007 - 10:36 am

The patch is totally bogus. irq 0 doesn't say anything about whether
the current CPU still works or not. You always need some local
interrupt. This basically disables the NMI watchdog for the non boot CPUs.

It's even wrong on i386 -- i wonder how that broken patch
made it in there. I'll remove it there.

-Andi
-

From: Thomas Gleixner
Date: Monday, October 1, 2007 - 11:54 am

Right, it's wrong for the broadcast case, but simply removing it will
trigger false positives on the CPU which runs the broadcast timer. I
fix this proper.

    tglx
-

From: Andi Kleen
Date: Monday, October 1, 2007 - 12:16 pm

I already did this here by checking for cpu != 0. But it also needs either tracking
or forbidding migrations of irq 0. I can take care of the patch.

-Andi 
-

From: Thomas Gleixner
Date: Monday, October 1, 2007 - 12:27 pm

I was thinking about the same fix. On i386 we already have the irq 
migration / balancing of irq 0 disabled. That's why we setup IRQ0 with
IRQ_NOBALANCING.

	tglx




-

From: Arjan van de Ven
Date: Monday, October 1, 2007 - 12:56 pm

btw doing this is a problem if the user decides to hot(un)plug cpu 0...
he then can't move the irqs away to do that
-

From: Dave Jones
Date: Monday, October 1, 2007 - 1:10 pm

On Mon, Oct 01, 2007 at 12:56:26PM -0700, Arjan van de Ven wrote:
 > On Mon, 1 Oct 2007 21:27:39 +0200 (CEST)
 > 
 > > > I already did this here by checking for cpu != 0. But it also needs
 > > > either tracking or forbidding migrations of irq 0. I can take care
 > > > of the patch.
 > > 
 > > I was thinking about the same fix. On i386 we already have the irq 
 > > migration / balancing of irq 0 disabled. That's why we setup IRQ0 with
 > > IRQ_NOBALANCING.
 > 
 > btw doing this is a problem if the user decides to hot(un)plug cpu 0...
 > he then can't move the irqs away to do that
 
You can't hot unplug cpu0.
Take a look in sysfs, no /sys/devices/system/cpu/cpu0/online file.

	Dave

-- 
http://www.codemonkey.org.uk
-

From: Paul E. McKenney
Date: Monday, October 1, 2007 - 1:11 pm

Last I tried it, x86_64 and i386 in fact prohibit hotunplugging CPU 0.

						Thanx, Paul
-

From: Thomas Gleixner
Date: Monday, October 1, 2007 - 2:17 pm

IRQ_NOBALANCING is not preventing cpu unplug. It moves the affinity to the 
next CPU, but the check in NMI watchdog for CPU == 0 would not longer 
work.

Fix below. Post .23 material. I work out a separate one for the x8664 
clock events series.

	tglx

------------>

[PATCH] i386: Fix nmi watchdog per cpu timer irq accounting

The clock events patches changed the interrupt distribution and the
local apic timer interrupt accounting for the broadcast case. The per
cpu clock events handler of the cpu, which runs the broadcast
interrupt, is executed directly in the broadcast irq context. This
does not invoke the low level arch code, which does the local apic
timer irq accounting. The work around for false positives in the nmi
watchdog was to add the irq0 interrupts (broadcast device) to the
local apic timer interrupts. This falsifies the results for the CPUs
which are not handling the broadcast interrupt, i.e. stuck CPUs might
be not detected, as noticed by Andi Kleen.

It would be possible to move the clockevents handler invocation of the
CPU which runs the broadcast interrupt into the tick device broadcast
function, but this would require to handle the per cpu device to this
function and perform the direct operation in the clock device specific
architecture code. Right now this is only i386 and x86_64, but MIPS is
on the way to use the broadcast mode as well.

Introduce a weak function tick_broadcast_account(), which allows x86
to adjust the local apic timer interrupt counter in the case when the
cpu local timer handler has been invoked. This keeps the cpu local
handler decision and invocation in the common code and allows x86 to
handle the nmi watchdog accounting correctly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 3d67ae1..180dde8 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -283,6 +283,16 @@ static void lapic_timer_broadcast(cpumask_t mask)
 }
 
 /*
+ * ...
From: Andi Kleen
Date: Monday, October 1, 2007 - 2:41 pm

That cannot happen right now because cpu_disable() on both i386/x86-64
reject CPU #0. So just setting IRQ_NOBALANCING is sufficient and both

That would not handle the case with a single CPU running only
irq  0 but not broadcasting I think.

I believe ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/fix-watchdog
is the correct fix

-Andi
-

From: Thomas Gleixner
Date: Monday, October 1, 2007 - 2:58 pm

Hmm. The only situation where this can happen is when you add
"nolapic_timer" to the command line on a single CPU system. We do not

Yup, I completely missed the fact, that we reject CPU#0 unplugging, so
your fix seems indeed to be more correct and simpler.

OTOH, the accounting hook would allow us to remove the IRQ#0 -> CPU#0
restriction. Not sure whether it's worth the trouble.

     tglx
-

From: Andi Kleen
Date: Monday, October 1, 2007 - 3:07 pm

Some SIS chipsets hang the machine when you migrate irq 0 to another
CPU. It's better to keep that Also I wouldn't be surprised if there are some
other assumptions about this elsewhere.

Ok in theory it could be done only on SIS, but that probably would really
not be worth the trouble

-Andi
-

From: Thomas Gleixner
Date: Monday, October 1, 2007 - 3:47 pm

Agreed.

I just got a x8664-hrt report, where I found the following oddity:

 0:       1197     172881   IO-APIC-edge      timer

That's one of those infamous AMD C1E boxen. Strange, all my systems have 
IRQ#0 on CPU#0 and nowhere else. Any idea ?

	tglx

-

From: Arjan van de Ven
Date: Monday, October 1, 2007 - 3:52 pm

On Tue, 2 Oct 2007 00:47:12 +0200 (CEST)

2 things;
the current irq balancers don't balance the timer interrupt, in fact
they'll leave it alone (so the chipset might balance)
older ones pin it to cpu 0 or rotate it....
-

From: Mika Penttilä
Date: Monday, October 1, 2007 - 9:56 pm

Here I have with stock FC7 (2.6.22.9-91) kernel :
 0:     107835  133459760   IO-APIC-edge      timer

Processor:
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 107
model name    : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
stepping    : 1
cpu MHz        : 2109.721
cache size    : 512 KB

MB:
Asus M2N-E (NF570)

--Mika


-

From: Arjan van de Ven
Date: Monday, October 1, 2007 - 10:00 pm

On Tue, 02 Oct 2007 07:56:24 +0300


fwiw this is entirely done by the hardware; no irq balancer has touched
this one (fc7 has the new irqbalancer which leaves irq0 alone)
-

From: Andi Kleen
Date: Monday, October 1, 2007 - 10:51 pm

Hmm, in lowestpriority mode it would be possible that the APIC changes
the CPU to #1 once; but IRQ 0 is always set to fixed mode. Also even 
if that happens you should have them all on 1.

Maybe the chipset is just ignoring the IO-APIC configuration in this case?

Is it always the same chipset? Is it seen on i386 too?

The problem is really that if this happens it's more than the NMI watchdog
that is broken. If you don't run an additional APIC timer interrupt on CPU #0
it's possible that CPU #0 won't schedule at all.

The only workaround for chipsets ignoring IRQ affinity would be to keep
track on which CPU irq 0 happens and then restart APIC timer interrupts
on the others (or send IPIs) as needed. But that would be fairly ugly.

IRQ 0 is unfortunately an generally under-validated area because Windows doesn't
use it.

-Andi

-

From: Thomas Gleixner
Date: Monday, October 1, 2007 - 11:18 pm

The clock events code does handle this already. The broadcast interrupt 
can come in on any cpu. It's just the nmi watchdog which would be affected 
by that.

	tglx
-

From: Pallipadi, Venkatesh
Date: Thursday, October 4, 2007 - 9:37 pm

Probably we can workaround this by keeping track of IRQ0 count at percpu
level and
use local apic timer + this percpu counter in NMI. Or just increment
local
apic timer count in IRQ0 with nohz enabled.

Thanks,
Venki
-

From: Thomas Gleixner
Date: Friday, October 5, 2007 - 1:37 pm

No, I tried that. It's ugly.

The per cpu accounting is the correct way to go if we want to take
care of those systems, which ignore the CPU0 binding of irq0.

See patch against the x86 tree below.

	tglx

-------------------->
commit 093976c7ad206a008bd5de4619f40f6bca4a79c3
Author: Thomas Gleixner <tglx@inhelltoy.tec.linutronix.de>
Date:   Fri Oct 5 22:19:18 2007 +0200

    x86: Fix irq0 / local apic timer accounting
    
    The clock events merge introduced a change to the nmi watchdog code to
    handle the not longer increasing local apic timer count in the
    broadcast mode. This is fine for UP, but on SMP it pampers over a
    stuck CPU which is not handling the broadcast interrupt due to the
    unconditional sum up of local apic timer count and irq0 count.
    
    To cover all cases we need to keep track on which CPU irq0 is
    handled. In theory this is CPU#0 due to the explicit disabling of irq
    balancing for irq0, but there are systems which ignore this on the
    hardware level. The per cpu irq0 accounting allows us to remove the
    irq0 to CPU0 binding as well.
    
    Add a per cpu counter for irq0 and evaluate this instead of the global
    irq0 count in the nmi watchdog code.
    
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/kernel/nmi_32.c b/arch/x86/kernel/nmi_32.c
index c7227e2..95d3fc2 100644
--- a/arch/x86/kernel/nmi_32.c
+++ b/arch/x86/kernel/nmi_32.c
@@ -353,7 +353,8 @@ __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
 	 * Take the local apic timer and PIT/HPET into account. We don't
 	 * know which one is active, when we have highres/dyntick on
 	 */
-	sum = per_cpu(irq_stat, cpu).apic_timer_irqs + kstat_cpu(cpu).irqs[0];
+	sum = per_cpu(irq_stat, cpu).apic_timer_irqs +
+		per_cpu(irq_stat, cpu).irq0_irqs;
 
 	/* if the none of the timers isn't firing, this cpu isn't doing much */
 	if (!touched && last_irq_sums[cpu] == sum) {
diff --git a/arch/x86/kernel/time_32.c ...
Previous thread: Proposed 2.6 Patch for AMD MIPS Alchemy au1550 I2C interface I2C interface by Chris David on Wednesday, September 26, 2007 - 10:48 am. (1 message)

Next thread: forcedeth question by roel on Wednesday, September 26, 2007 - 11:35 am. (1 message)