Yeah.
Frederic (re-)discovered this problem via very hard to debug crashes when he
extended perf call-graph tracing to have a bit larger buffer and used
percpu_alloc() for it (which is entirely reasonable in itself).
Ok. We can solve it by allocating the space from the non-vmalloc percpu area -
8K per CPU.
I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel
stack already, due to entering via the IST stack mechanism, which is
non-nesting and which enters at the same point - right?
We could solve that by copying that small stack frame off before entering the
'generic' NMI routine - but it all feels a bit pulled in by the hair.
I feel uneasy about taking pagefaults from the NMI handler. Even if we
implemented it all correctly, who knows what CPU erratas are waiting there to
be discovered, etc ...
I think we should try to muddle through by preventing these situations from
happening (and adding a WARN_ONCE() to the vmalloc page-fault handler would
certainly help as well), and only go to more clever schemes if no other option
looks sane anymore?
Thanks,
Ingo
--