How about turning off preemption and using a per-CPU buffer?
Alternatively you could turn off IRQs, poke a per-CPU value to clue
in any incoming NMIs, and switch to a separate stack. I suppose if
you wanted it to work with all of 16 bytes of stack left on both
thread and IRQ stacks, you could have separate per-CPU NMI stacks;
the stack-dump would be poking a special per-CPU value and sending
ourselves an NMI.
There are probably a half dozen other variants on ways to run
screaming to the CPU saying "It hurts mommy!" and get a new stack in
which we can play for a while.
Cheers,
Kyle Moffett
-