Re: [PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

Previous thread: Bug: 2.6.24-smp: Eeek! page_mapcount(page) went negative! (-1) by kerndev on Wednesday, February 6, 2008 - 11:13 am. (5 messages)

Next thread: MM kernels 2.6.24-rc*-mm*, 2.6.24-mm1: gnome-terminal stuck in tty_poll by Zan Lynx on Wednesday, February 6, 2008 - 12:38 pm. (9 messages)
From: Neil Horman
Date: Wednesday, February 6, 2008 - 12:25 pm

Hey all-
	A hang on kdump was reported to me awhile back, only when systems died
via nmi watchdog panic.  The hang wouldn't always be in the same place, but it
would usually be somewhere down in purgatory.  In looking at the code, it
occured to me that since, during an nmi interrupt, we won't be able to handle
additional interrupts, that we won't be able to halt the other processors on a
system like we try to do in machine_crash_shutdown.  As such, it appears that
leaving the other cpus running exposes us to the risk that another processor
will encounter an error and halt the system while we are trying to boot the
kdump kernel, and that can result in a hang.  I wrote the attached patch to end
the nmi interrupt prior to calling crash_kexec from within die_nmi, and testing
here has proven successfull.

Regards
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 traps_32.c |    2 ++
 traps_64.c |    5 ++++-
 2 files changed, 6 insertions(+), 1 deletion(-)


diff --git a/arch/x86/kernel/traps_32.c b/arch/x86/kernel/traps_32.c
index 3cf7297..4443e8e 100644
--- a/arch/x86/kernel/traps_32.c
+++ b/arch/x86/kernel/traps_32.c
@@ -751,6 +751,8 @@ void __kprobes die_nmi(struct pt_regs *regs, const char *msg)
 	 * and might aswell get out now while we still can.
 	*/
 	if (!user_mode_vm(regs)) {
+		nmi_exit();
+		local_irq_enable();
 		current->thread.trap_no = 2;
 		crash_kexec(regs);
 	}
diff --git a/arch/x86/kernel/traps_64.c b/arch/x86/kernel/traps_64.c
index efc66df..0a50e70 100644
--- a/arch/x86/kernel/traps_64.c
+++ b/arch/x86/kernel/traps_64.c
@@ -608,8 +608,11 @@ void __kprobes die_nmi(char *str, struct pt_regs *regs, int do_panic)
 	 */
 	printk(str, smp_processor_id());
 	show_registers(regs);
-	if (kexec_should_crash(current))
+	if (kexec_should_crash(current)) {
+		nmi_exit();
+		local_irq_enable();
 		crash_kexec(regs);
+	}
 	if (do_panic || panic_on_oops)
 		panic("Non maskable interrupt");
 	oops_end(flags, NULL, SIGBUS);
-- ...
From: Vivek Goyal
Date: Wednesday, February 6, 2008 - 12:40 pm

Hi Neil,

Why wouldn't I be able to stop other cpus if I am inside an NMI handler? I
just need to send an NMI IPI to other cpus and they should be able to
receive and handle it?

Thanks
Vivek
--

From: Neil Horman
Date: Wednesday, February 6, 2008 - 1:12 pm

Can an APIC accept an NMI while already handling an NMI?  I didn't think they
would interrupt one another, but rather, pend until such time as the previous
NMI was cleared

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
--

From: H. Peter Anvin
Date: Wednesday, February 6, 2008 - 1:21 pm

The CPU certainly won't (there is a hidden flag that's cleared on IRET 
which disables NMI; it's possible to re-enable NMI by executing a dummy 
IRET inside the NMI handler.)

	-hpa
--

From: Neil Horman
Date: Wednesday, February 6, 2008 - 2:04 pm

So that would be another solution I would expect.  Would you rather I
investigate that, or is the proposed solution more reasonable?

Neil


-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
--

From: Vivek Goyal
Date: Wednesday, February 6, 2008 - 1:35 pm

Hi Neil,

CPU will not accept another NMI if it is already handling the NMI but in
this case only crashing CPU is in NMI. Other cpus are not and they will
accept the NMI (When crashing cpu sends NMI IPI for stopping these).

So I am not able to understand how clearing the NMI on crashing cpu is
helpful?

Thanks
Vivek
--

From: Ingo Molnar
Date: Wednesday, February 6, 2008 - 3:00 pm

looks good to me, but please move the local_irq_enable() to within 
crash_kexec() instead - probably inside the "got the kexec lock" 
section. That makes crash_kexec() use generally safer too i guess: right 
it seems that die() too can call crash_kexec() with irqs disabled - and 
can thus hang in smp_send_stop() [or wherever it hung before].

	Ingo
--

From: Vivek Goyal
Date: Wednesday, February 6, 2008 - 3:48 pm

In general, I think we should not be servicing interrupts once the system
has crashed and crash_kexec() has been invoked. 

In fact, right now machine_crash_shutdown() explicity disables interrupt
before sending NMIs to other cpus to stop these cpus and which makes sense to
me.

I am wondering if interrupts are disabled on crashing cpu or if crashing
cpu is inside die_nmi(), how would it stop/prevent delivery of NMI IPI to
other cpus.

Am I missing something obivious?

Thanks
Vivek
--

From: Ingo Molnar
Date: Wednesday, February 6, 2008 - 3:53 pm

i wondered about that too. kexec should be as atomic as it can be - 
enabling interrupts only opens up a window for another crash (more 
memory corruption, etc. etc) to happen.

	Ingo
--

From: H. Peter Anvin
Date: Wednesday, February 6, 2008 - 3:56 pm

I don't see how it would.

	-hpa
--

From: Ingo Molnar
Date: Wednesday, February 6, 2008 - 4:36 pm

cross-CPU IPIs are a bit fragile on some PC platforms. So if the kexec 
code relies on getting IPIs to all other CPUs, it might not be able to 
do it reliably. There might be limitations on how many APIC irqs there 
can be queued at a time, and if those slots are used up and the CPU is 
not servicing irqs then stuff gets retried. This might even affect NMIs 
sent via APIC messages - not sure about that.

	Ingo
--

From: Vivek Goyal
Date: Wednesday, February 6, 2008 - 4:50 pm

- Kexec code does not wait infinitely for destination cpu to respond to
  NMI. If destination cpu does not reposond in certain amount of time,
  execution continues. So even if NMI was not delivered to destination
  cpu kexec code should have continued. (Dangerous though, as we don't
  know what other cpu will be doing in the mean time.)

- Even if there is a limitation on how many interrupts can be queued up
  (including NMI), I am not sure how this patch will help that situation.
  This patch is not doing anything on destination cpu (assuming destination
  cpu is also not executing die_nmi() at the same time)

  In fact, even if other cpus are servicing die_nmi() they will end up
  spinning on kexec_lock inside crash_kexec().

I think there is more to this problem then just EOI stuff.

Thanks
Vivek
--

From: Eric W. Biederman
Date: Wednesday, February 6, 2008 - 5:31 pm

The design was as follows:
- Doing anything in the crashing kernel is unreliable.
- We do not have the information to do anything useful in the recovery/target
  kernel.
- Having the other cpus stopped is very nice as it reduces the amount of
  weirdness happening.  We do not share the same text or data addresses
  so stopping the other cpus is not mandatory.  On some other architectures
  there are cpu tables that must live at a fixed address but this is not
  the case on x86.
- Having the location the other cpus were running at is potentially very
  interesting debugging information.

Therefore the intent of the code is to send an NMI to each other cpu.  With
a timeout of a second or so.  So that if the NMI do not get sent we continue
on.

There is certainly still room for improving the robustness by not shutting
down the ioapics and using less general infrastructure code on that path.
That said I would be a little surprised if that is what is biting us.

Looking at the patch the local_irq_enable() is totally bogus.  As soon
was we hit machine_crash_shutdown the first thing we do is disable irqs.

I'm wondering if someone was using the switch cpus on crash patch that was
floating around.  That would require the ipis to work.

I don't know if nmi_exit makes sense.  There are enough layers of abstraction
in that piece of code I can't quickly spot the part that is banging the hardware.

The location of nmi_exit in the patch is clearly wrong.  crash_kexec is a noop
if we don't have a crash kernel loaded (and if we are not the first cpu into it),
so if we don't execute the crash code something weird may happen.  Further the
code is just more maintainable if that kind of code lives in machine_crash_shutdown.



Eric
--

From: Ingo Molnar
Date: Wednesday, February 6, 2008 - 5:39 pm

nmi_exit() has no hw effects - it's just our own bookeeping.

the hw knows that we finished the NMI when we do an iret. Perhaps that's 
the bug or side-effect that made the difference: via enabling irqs we 
get an irq entry, and that does an iret and clears the NMI nested state 
- allowing the kexec context to proceed? I suspect kexec() will do an 
iret eventually (at minimum in the booted up kernel's context) - all 
NMIs are blocked up to that point and maybe the APIC doesnt really like 
being frobbed in that state? In any case, the local_irq_enable() is just 
wrong - it's the worst thing a crashing kernel can do. Perhaps doing an 
intentional iret with a prepared stack-let that just restores to 
still-irqs-off state and jumps to the next instruction could 'exit' the 
NMI context without really having to exit it in the kernel code flow?

	Ingo
--

From: Eric W. Biederman
Date: Wednesday, February 6, 2008 - 6:30 pm

Well I think this is slightly on the wrong track.  The original patch description
said  we get as far as purgatory.  Purgatory is the bit of C code form
/sbin/kexec that runs just before our second kernel.  It sets up
arguments and verify the target kernel has a valid sha256sum.  If
purgatory detects data corruption it spins, to prevent a corrupt
recovery kernel from doing something nasty.

It appears that the primary crash_kexec path is working fine.

The original description speculated that we had non-stopped cpus that
were telling the hardware to shut off.

I don't see what the hang is.  However the goal apparently is to make
the kexec on panic path more robust so that we can take crash dumps in
more strange cases.

We can get NMI from the nmi watchdogs so it is possible this happens
on legitimate hardware so there is a chance this is deterministic and
that we can get enough information to debug and fix the original
error.

If part of the problem is getting to crash_kexec my inclination is to
move the call to crash_kexec up as early as possible in die_nmi.  As
we may simply be hanging in printk or something stupid like that.

It is weird that only the 32bit die_nmi path calls bust_spinlocks.

I'm not really happy with the secondary cpus taking whole notify_die
path as that is more general purpose infrastructure that might go
bad.  However it doesn't appear broken, and it should not be critical
to the crash dump process.

Eric
--

From: Neil Horman
Date: Thursday, February 7, 2008 - 5:17 am

Ingo noted a few posts down the nmi_exit doesn't actually write to the APIC EOI
register, so yeah, I agree, its bogus (and I apologize, I should have checked
that more carefully).  Nevertheless, this patch consistently allowed a hangning
machine to boot through an Nmi lockup.  So I'm forced to wonder whats going on
then that this patch helps with.  perhaps its a just a very fragile timing
Definately not the case, I did a clean build from a cvs tree to test this and

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/
--

From: Ingo Molnar
Date: Thursday, February 7, 2008 - 5:24 am

try a dummy iret, something like:

  asm volatile ("pushf; push $1f; iret; 1: \n");

to get the CPU out of its 'nested NMI' state. (totally untested)

the idea is to push down an iret frame to the kernel stack that will 
just jump to the next instruction and gets it out of the NMI nesting. 
Note: interrupts will/must still be disabled, despite the iret. (the 
ordering of the pushes might be wrong, we might need more than that for 
a valid iret, etc. etc.)

	Ingo
--

From: Neil Horman
Date: Thursday, February 7, 2008 - 1:37 pm

Will do.  I'll report results as soon as I have them,.  Thanks!
Neil

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/
--

From: Neil Horman
Date: Friday, February 8, 2008 - 9:14 am

Just tried this experiment and it met with success.  Executing a dummy iret
instruction got us to boot the kdump kernel successfully.  

Thoughts on how we should handle this from here?


Regards
Neil

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/
--

From: Vivek Goyal
Date: Friday, February 8, 2008 - 9:45 am

Interesting. So that means there is some operation we can't perform when
we are in NMI handler (Or nested NMIs, I don't know if this is nested NMI
case ).

Even if we initiated crash dump in NMI handler, next kernel should unlock
that state as soon as we enable interrupts in next kernel (iret will be
called).

So the only issue here will be if need to put the explicit logic to unlock
the NMI earlier (Either in crashing kernel after clearing IDT or in
purgatory code). Anything earlier then that, will be dangerous though, handling
another NMI while we are already crashed and doing final preparations to jump
to the new kernel.

Neil, is it possible to do some serial console debugging to find out
where exactly we are hanging? Beats me, what's that operation which can
not be executed while being in NMI handler and makes system to hang. I am
also curious to know if it is nested NMI case.

Thanks
Vivek
--

From: Neil Horman
Date: Friday, February 8, 2008 - 10:26 am

I can try, but my last attempts to do so fuond me hung in various places in
purgatory or very early in head.S.  I'll try again though, to see if I can get
some consistency.


-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
--

From: Neil Horman
Date: Tuesday, February 12, 2008 - 2:08 pm

Hey-
	Some intermediate results:

I've instrumented head.S in the kernel with the following code:
#define SEROUT(z) \     
mov $0x3F8,%dx;\
movb z,%al;\
outb %dx


And peppered different ascii characters throughout the startup code from
startup_32 to right before the jump to start_kernel.  When I panic the system
via an:
echo c > /proc/sysrq_trigger
I see an appropriate sequence of characters on the serial console

When I panic the box by forcing an NMI watchdog timeout however, I see nothing.
 The machine will either hang, or reset into the bios.  I think this is
reasonably conclusive in its indication that we're not getting into the second
kernel when this problem occurs.  Next I'll instrument the purgatory code in a
simmilar way.

Regards

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/
--

From: Eric W. Biederman
Date: Friday, February 15, 2008 - 7:02 am

My apologies for the late reply.  You should be able to use the --console-serial
option to kexec to get output from purgatory, which will catch everything
except a little bit of the assembly stub.  purgatory even has a stripped down
version of printf you can call.

Eric
--

From: Neil Horman
Date: Wednesday, February 20, 2008 - 7:57 am

Hey, further update.  I instrumented purgatory, and turned on the console
options that Eric mentioned, which resulted in no output to the console, so I've
concluded that we don't get to purgatory at all.  Further instrumenting shows
that the last C level printk I can issue is right before we call
relocate_kernel.  So it appears that we hang/reset somewhere in there.  I'll
instrument down that path shortly.

Regards
Neil

--

From: Andi Kleen
Date: Friday, February 8, 2008 - 9:54 am

Just if you do this while running on the NMI stack (and I think
you do if you insert it at the same place as Neil's original patch)
and another NMI happens for some reason you will be dead
because your stack will corrupt.

If you really want nested NMIs you would need to fix up the NMI
stacks first by pointing them below your own stack.

-Andi

--

Previous thread: Bug: 2.6.24-smp: Eeek! page_mapcount(page) went negative! (-1) by kerndev on Wednesday, February 6, 2008 - 11:13 am. (5 messages)

Next thread: MM kernels 2.6.24-rc*-mm*, 2.6.24-mm1: gnome-terminal stuck in tty_poll by Zan Lynx on Wednesday, February 6, 2008 - 12:38 pm. (9 messages)