Debugging Multiple CPUs

Submitted by Jeremy
on October 22, 2007 - 2:19pm

"Sysrq-p is pretty useless unless you can force the keyboard interrupt and the spinning process onto the same CPU," noted Chuck Ebbert during a discussion centered around debugging tasks stuck in a running state. Pressing the <Alt><SysRq><p> key combination is used for debugging, dumping the registers and flags from the CPU that handles the keypress interrupt to the console. UltraSPARC maintainer, David Miller, replied, "yes, I find this a painful limitation too," adding:

"Sparc64 used to dump the registers on all active cpus for show_regs() via a cross-call, and this was incredibly useful. But I disabled that as soon as I started playing with Niagara because at 32 cpus and larger the output is just too voluminous to be useful."

David then suggested, "what might be appropriate is just to get a one-line program counter dump on every cpu via some new sysrq keystroke." Chuck noted that similar functionality is provided by a patch in the -mm kernel, "IIRC -mm had something like this but it was buggy because we were sending IPIs to each processor asking them to print their state. Maybe it would work if we had a way of making them dump their state to a memory location and then collected and printed it from the CPU that's handling the sysrq."


From: Jeff Garzik <jeff@...>
Subject: [2.6.23] tasks stuck in running state?
Date: Oct 19, 5:39 pm 2007

On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
certain behavior at least once a day. I'll start a kernel build (make
-sj5 on this box), and it will "hang" in the following way:

> 31003 ? S 0:04 sshd: jgarzik@pts/0
> 31004 pts/0 Ss 0:02 \_ -bash
> 8280 pts/0 S+ 0:00 \_ make ARCH=i386 -sj4
> 8690 pts/0 Z+ 0:00 \_ [rm]
> 8691 pts/0 S+ 0:00 \_ /bin/sh -c cat include/config/kernel.release 2> /dev/null
> 8692 pts/0 R+ 6:12 \_ cat include/config/kernel.release

Specifically, the symptom is a process, often a simple one like cat(1)
or rm(1) or somewhere in check-headers, will stay in the running state,
accumulating CPU time.

If I Ctrl-C the build, and start over, the build will normally -not- get
stuck at the same point, but proceed to chew through one of a bazillion
allmodconfig builds.

I also see this occasionally on my main workstation (also
2.6.23/x86-64/Fedora-7), though not as frequently.

This is a new behavior since the new scheduler was merged... I think.

Nothing more concrete to report at this time. I cannot easily reproduce
the behavior, as it happens [apparently] randomly sometime during the
day. Generally, the files these programs are dealing with are -always-
in the pagecache, if that makes any difference.

Jeff

-


From: Chuck Ebbert <cebbert@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 19, 5:53 pm 2007

On 10/19/2007 05:39 PM, Jeff Garzik wrote:
> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> certain behavior at least once a day. I'll start a kernel build (make
> -sj5 on this box), and it will "hang" in the following way:
>

Can you try to strace the hanging task?

-


From: Jeff Garzik <jeff@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 19, 6:03 pm 2007

Chuck Ebbert wrote:
> On 10/19/2007 05:39 PM, Jeff Garzik wrote:
>> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
>> certain behavior at least once a day. I'll start a kernel build (make
>> -sj5 on this box), and it will "hang" in the following way:
>>
>
> Can you try to strace the hanging task?

Well, to the system it's running, so that doesn't do much of anything...

>
> 8482 pts/0 S+ 0:00 \_ /bin/sh /garz/repo/misc-2.6/scripts/hdrcheck.sh /garz/repo/misc-2.6/usr/include /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h /garz/repo/misc-2.6/usr/include/linux/.check.kernelcapi.h
> 8484 pts/0 R+ 3:10 \_ grep ^[ \t]*#[ \t]*include[ \t]*< /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h
> 8486 pts/0 S+ 0:00 \_ cut -f2 -d<
> 8487 pts/0 S+ 0:00 \_ cut -f1 -d>
> 8488 pts/0 S+ 0:00 \_ egrep ^linux|^asm
> [jgarzik@pretzel misc-2.6]$ strace -p8484
> Process 8484 attached - interrupt to quit
[sits there, chewing up CPU grepping a 47-line header file]
-


From: Chuck Ebbert <cebbert@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 19, 6:18 pm 2007

On 10/19/2007 06:03 PM, Jeff Garzik wrote:
>> [jgarzik@pretzel misc-2.6]$ strace -p8484
>> Process 8484 attached - interrupt to quit
> [sits there, chewing up CPU grepping a 47-line header file]
>

And sysrq-p is pretty useless unless you can force the keyboard
interrupt and the spinning process onto the same CPU.

-


From: David Miller <davem@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 19, 8:01 pm 2007

From: Chuck Ebbert
Date: Fri, 19 Oct 2007 18:18:08 -0400

> On 10/19/2007 06:03 PM, Jeff Garzik wrote:
> >> [jgarzik@pretzel misc-2.6]$ strace -p8484
> >> Process 8484 attached - interrupt to quit
> > [sits there, chewing up CPU grepping a 47-line header file]
> >
>
> And sysrq-p is pretty useless unless you can force the keyboard
> interrupt and the spinning process onto the same CPU.

Yes, I find this a painful limitation too.

Sparc64 used to dump the registers on all active cpus for show_regs()
via a cross-call, and this was incredibly useful. But I disabled that
as soon as I started playing with Niagara because at 32 cpus and
larger the output is just too voluminous to be useful.

What might be appropriate is just to get a one-line program counter
dump on every cpu via some new sysrq keystroke.
-


From: Chuck Ebbert <cebbert@...> Subject: Re: [2.6.23] tasks stuck in running state? Date: Oct 21, 11:59 am 2007

On 10/19/2007 08:01 PM, David Miller wrote:
>
> What might be appropriate is just to get a one-line program counter
> dump on every cpu via some new sysrq keystroke.
>

IIRC -mm had something like this but it was buggy because we were
sending IPIs to each processor asking them to print their state.
Maybe it would work if we had a way of making them dump their
state to a memory location and then collected and printed it from
the CPU that's handling the sysrq.
-