On Wed, Nov 10, 2010 at 2:28 PM, Andrew Lutomirski <andy@luto.us> wrote:
I tracked it down. The interrupt code in 2.6.36 is totally broken ---
it acknowledges the interrupt *in the bottom half*. This might work
by accident if the bottom half gets queued on a different CPU, but
something probably changed (concurrency-managed workqueues?) that make
the BH end up on the same cpu. So the cpu starves the BH and there
goes a cpu.
Then the clocksource watchdog hits and takes the whole system down
when it calls stop_machine, which also gets starved on that cpu.
Patch coming.
--Andy
--