Thanks to tglx and ghaskins for all the help in tracking down a very early nmi_watchdog crash on certain x86_64 machines.
The patch is totally bogus. irq 0 doesn't say anything about whether the current CPU still works or not. You always need some local interrupt. This basically disables the NMI watchdog for the non boot CPUs. It's even wrong on i386 -- i wonder how that broken patch made it in there. I'll remove it there. -Andi -
Right, it's wrong for the broadcast case, but simply removing it will
trigger false positives on the CPU which runs the broadcast timer. I
fix this proper.
tglx
-
I already did this here by checking for cpu != 0. But it also needs either tracking or forbidding migrations of irq 0. I can take care of the patch. -Andi -
I was thinking about the same fix. On i386 we already have the irq migration / balancing of irq 0 disabled. That's why we setup IRQ0 with IRQ_NOBALANCING. tglx -
btw doing this is a problem if the user decides to hot(un)plug cpu 0... he then can't move the irqs away to do that -
On Mon, Oct 01, 2007 at 12:56:26PM -0700, Arjan van de Ven wrote: > On Mon, 1 Oct 2007 21:27:39 +0200 (CEST) > > > > I already did this here by checking for cpu != 0. But it also needs > > > either tracking or forbidding migrations of irq 0. I can take care > > > of the patch. > > > > I was thinking about the same fix. On i386 we already have the irq > > migration / balancing of irq 0 disabled. That's why we setup IRQ0 with > > IRQ_NOBALANCING. > > btw doing this is a problem if the user decides to hot(un)plug cpu 0... > he then can't move the irqs away to do that You can't hot unplug cpu0. Take a look in sysfs, no /sys/devices/system/cpu/cpu0/online file. Dave -- http://www.codemonkey.org.uk -
Last I tried it, x86_64 and i386 in fact prohibit hotunplugging CPU 0. Thanx, Paul -
IRQ_NOBALANCING is not preventing cpu unplug. It moves the affinity to the next CPU, but the check in NMI watchdog for CPU == 0 would not longer work. Fix below. Post .23 material. I work out a separate one for the x8664 clock events series. tglx ------------> [PATCH] i386: Fix nmi watchdog per cpu timer irq accounting The clock events patches changed the interrupt distribution and the local apic timer interrupt accounting for the broadcast case. The per cpu clock events handler of the cpu, which runs the broadcast interrupt, is executed directly in the broadcast irq context. This does not invoke the low level arch code, which does the local apic timer irq accounting. The work around for false positives in the nmi watchdog was to add the irq0 interrupts (broadcast device) to the local apic timer interrupts. This falsifies the results for the CPUs which are not handling the broadcast interrupt, i.e. stuck CPUs might be not detected, as noticed by Andi Kleen. It would be possible to move the clockevents handler invocation of the CPU which runs the broadcast interrupt into the tick device broadcast function, but this would require to handle the per cpu device to this function and perform the direct operation in the clock device specific architecture code. Right now this is only i386 and x86_64, but MIPS is on the way to use the broadcast mode as well. Introduce a weak function tick_broadcast_account(), which allows x86 to adjust the local apic timer interrupt counter in the case when the cpu local timer handler has been invoked. This keeps the cpu local handler decision and invocation in the common code and allows x86 to handle the nmi watchdog accounting correctly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c index 3d67ae1..180dde8 100644 --- a/arch/i386/kernel/apic.c +++ b/arch/i386/kernel/apic.c @@ -283,6 +283,16 @@ static void lapic_timer_broadcast(cpumask_t mask) } /* + * ...
That cannot happen right now because cpu_disable() on both i386/x86-64 reject CPU #0. So just setting IRQ_NOBALANCING is sufficient and both That would not handle the case with a single CPU running only irq 0 but not broadcasting I think. I believe ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/fix-watchdog is the correct fix -Andi -
Hmm. The only situation where this can happen is when you add
"nolapic_timer" to the command line on a single CPU system. We do not
Yup, I completely missed the fact, that we reject CPU#0 unplugging, so
your fix seems indeed to be more correct and simpler.
OTOH, the accounting hook would allow us to remove the IRQ#0 -> CPU#0
restriction. Not sure whether it's worth the trouble.
tglx
-
Some SIS chipsets hang the machine when you migrate irq 0 to another CPU. It's better to keep that Also I wouldn't be surprised if there are some other assumptions about this elsewhere. Ok in theory it could be done only on SIS, but that probably would really not be worth the trouble -Andi -
Agreed. I just got a x8664-hrt report, where I found the following oddity: 0: 1197 172881 IO-APIC-edge timer That's one of those infamous AMD C1E boxen. Strange, all my systems have IRQ#0 on CPU#0 and nowhere else. Any idea ? tglx -
On Tue, 2 Oct 2007 00:47:12 +0200 (CEST) 2 things; the current irq balancers don't balance the timer interrupt, in fact they'll leave it alone (so the chipset might balance) older ones pin it to cpu 0 or rotate it.... -
Here I have with stock FC7 (2.6.22.9-91) kernel : 0: 107835 133459760 IO-APIC-edge timer Processor: vendor_id : AuthenticAMD cpu family : 15 model : 107 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ stepping : 1 cpu MHz : 2109.721 cache size : 512 KB MB: Asus M2N-E (NF570) --Mika -
On Tue, 02 Oct 2007 07:56:24 +0300 fwiw this is entirely done by the hardware; no irq balancer has touched this one (fc7 has the new irqbalancer which leaves irq0 alone) -
Hmm, in lowestpriority mode it would be possible that the APIC changes the CPU to #1 once; but IRQ 0 is always set to fixed mode. Also even if that happens you should have them all on 1. Maybe the chipset is just ignoring the IO-APIC configuration in this case? Is it always the same chipset? Is it seen on i386 too? The problem is really that if this happens it's more than the NMI watchdog that is broken. If you don't run an additional APIC timer interrupt on CPU #0 it's possible that CPU #0 won't schedule at all. The only workaround for chipsets ignoring IRQ affinity would be to keep track on which CPU irq 0 happens and then restart APIC timer interrupts on the others (or send IPIs) as needed. But that would be fairly ugly. IRQ 0 is unfortunately an generally under-validated area because Windows doesn't use it. -Andi -
The clock events code does handle this already. The broadcast interrupt can come in on any cpu. It's just the nmi watchdog which would be affected by that. tglx -
Probably we can workaround this by keeping track of IRQ0 count at percpu level and use local apic timer + this percpu counter in NMI. Or just increment local apic timer count in IRQ0 with nohz enabled. Thanks, Venki -
No, I tried that. It's ugly.
The per cpu accounting is the correct way to go if we want to take
care of those systems, which ignore the CPU0 binding of irq0.
See patch against the x86 tree below.
tglx
-------------------->
commit 093976c7ad206a008bd5de4619f40f6bca4a79c3
Author: Thomas Gleixner <tglx@inhelltoy.tec.linutronix.de>
Date: Fri Oct 5 22:19:18 2007 +0200
x86: Fix irq0 / local apic timer accounting
The clock events merge introduced a change to the nmi watchdog code to
handle the not longer increasing local apic timer count in the
broadcast mode. This is fine for UP, but on SMP it pampers over a
stuck CPU which is not handling the broadcast interrupt due to the
unconditional sum up of local apic timer count and irq0 count.
To cover all cases we need to keep track on which CPU irq0 is
handled. In theory this is CPU#0 due to the explicit disabling of irq
balancing for irq0, but there are systems which ignore this on the
hardware level. The per cpu irq0 accounting allows us to remove the
irq0 to CPU0 binding as well.
Add a per cpu counter for irq0 and evaluate this instead of the global
irq0 count in the nmi watchdog code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
diff --git a/arch/x86/kernel/nmi_32.c b/arch/x86/kernel/nmi_32.c
index c7227e2..95d3fc2 100644
--- a/arch/x86/kernel/nmi_32.c
+++ b/arch/x86/kernel/nmi_32.c
@@ -353,7 +353,8 @@ __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
* Take the local apic timer and PIT/HPET into account. We don't
* know which one is active, when we have highres/dyntick on
*/
- sum = per_cpu(irq_stat, cpu).apic_timer_irqs + kstat_cpu(cpu).irqs[0];
+ sum = per_cpu(irq_stat, cpu).apic_timer_irqs +
+ per_cpu(irq_stat, cpu).irq0_irqs;
/* if the none of the timers isn't firing, this cpu isn't doing much */
if (!touched && last_irq_sums[cpu] == sum) {
diff --git a/arch/x86/kernel/time_32.c ...