Re: oprofile + hibernation = badness

Previous thread: [git pull] IDE fixes by Bartlomiej Zolnierkiewicz on Monday, August 18, 2008 - 1:22 pm. (1 message)

Next thread: [PATCH 1/2] byteorder: add new headers for make headers-install by Harvey Harrison on Monday, August 18, 2008 - 1:38 pm. (3 messages)
From: Vegard Nossum
Date: Monday, August 18, 2008 - 1:32 pm

Hi,

I'm probably crazy to do it, but...

    $ opcontrol --start
    $ echo disk > /sys/power/state

...leads to lots of badness/strangeness. First, lots of APIC errors:

APIC error on CPU0: 40(40)
[repeat 8 times]
APIC error on CPU1: 40(40)
APIC error on CPU1: 40(40)
APIC error on CPU0: 40(40)

These keep on coming all through the suspend/shutdown sequence, also
intermixing with other messages. I'm guessing oprofile is trying to
NMI CPUs that have been brought down?

Now I get some ACPI Exceptions, but I think that these are unrelated
to starting oprofile, because I have seen them on regular shutdowns as
well:

PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.00 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
PM: Shrinking memory... done (0 pages freed)
PM: Freed 0 kbytes in 0.44 seconds (0.00 MB/s)
Suspending console(s) (use no_console_suspend to debug)
ACPI Exception (exoparg2-0444): AE_AML_PACKAGE_LIMIT, Index
(000000007) is beyond end of object [20080609]
ACPI Error (psparse-0530): Method parse/execution failed
[\_SB_.PCI0.IDE0.GTM_] (Node f783bfc0), AE_AML_PACKAGE_LIMIT
ACPI Error (psparse-0530): Method parse/execution failed
[\_SB_.PCI0.IDE0.CHN0._GTM] (Node f783bb40), AE_AML_PACKAGE_LIMIT
ACPI handle has no context!
serial 00:0d: disabled
serial 00:06: disabled
ehci_hcd 0000:00:1d.7: PCI INT A disabled
uhci_hcd 0000:00:1d.3: PCI INT D disabled
uhci_hcd 0000:00:1d.2: PCI INT C disabled
uhci_hcd 0000:00:1d.1: PCI INT B disabled
uhci_hcd 0000:00:1d.0: PCI INT A disabled
ACPI: Preparing to enter system sleep state S4

After that, I see the message "WQ on CPU0, prefer CPU1" a few times.
This must also be bad.

Now some warnings:

------------[ cut here ]------------
WARNING: at /uio/arkimedes/s29/vegardno/git-working/linux-2.6/kernel/smp.c:328 s
mp_call_function_mask+0x194/0x1a0()
Pid: 3711, comm: bash Not tainted 2.6.27-rc3-00415-g122c9e0 #10
 [<c013584f>] warn_on_slowpath+0x4f/0x80
 ...
From: Rafael J. Wysocki
Date: Monday, August 18, 2008 - 1:51 pm

They CPUs are taken down after suspending devices.  If you boot the
kernel with 'no_console_suspend' in the command line, there will be more
debug information to have a look at.

Also, you can do

# echo core > /sys/power/pm_test

before the '$ echo disk > /sys/power/state' (that only tests the suspend
sequence without actually hibernating) and see what symptoms will be

That last thing is probably the safest to do at the moment.

Apparently nmi_suspend() conflicts with oprofile somehow.  Also, the offlining
of non-boot CPUs may confuse it.  It would be helpful to check if the CPU
hotplug works with oprofile.

Thanks,
Rafael
--

From: Vegard Nossum
Date: Monday, August 18, 2008 - 2:08 pm

That is a good suggestion :-)

Here is offlining:

CPU 1 is now offline
lockdep: fixing up alternatives.
SMP alternatives: switching to UP code
CPU0 attaching NULL sched-domain.
WQ on CPU0, prefer CPU1
CPU1 attaching NULL sched-domain.
CPU0 attaching sched-domain:
 domain 0: span 0 level CPU
  groups: 0
WQ on CPU0, prefer CPU1
WQ on CPU0, prefer CPU1
WQ on CPU0, prefer CPU1
[repeat last message indefinitely]

Here is onlining:

Booting processor 1/1 ip 6000
Initializing CPU#1
WQ on CPU0, prefer CPU1
WQ on CPU0, prefer CPU1
Calibrating delay using timer specific routine.. 5986.15 BogoMIPS (lpj=29930790)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU: Physical Processor ID: 0
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel P4/Xeon Extended MCE MSRs (24) available
CPU1: Thermal monitoring enabled
x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
CPU1: Intel(R) Pentium(R) 4 CPU 3.00GHz stepping 05
checking TSC synchronization [CPU#0 -> CPU#1]:
Measured 120 cycles TSC warp between CPUs, turning off TSC clock.
Marking TSC unstable due to check_tsc_sync_source failed
APIC error on CPU1: 00(40)
Clockevents: could not switch to one-shot mode:<7>APIC error on CPU1: 40(40)
 lapic is not functional.
Could not switch to high resolution mode on CPU 0
Clockevents: could not switch to one-shot mode: lapic is not functional.
Could not switch to high resolution mode on CPU 1
APIC error on CPU1: 40(40)
[sched domains messages
WQ on CPU0, prefer CPU1
APIC error on CPU1: 40(40)
[repeat last message 9 times]

Then follows this pattern indefinitely:

WQ on CPU0, prefer CPU1
APIC error on CPU1: 40(40)
[repeat last message 9 times]

That's basically the same thing as I saw with suspend. So it can be
reproduced easily with CPU hotplug.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as ...
From: Rafael J. Wysocki
Date: Monday, August 18, 2008 - 2:15 pm

Well, I don't know who's the right person to ask about the CPU hotplug.
Andrew, can you help please?

Rafael
--

From: Andrew Morton
Date: Monday, August 18, 2008 - 2:29 pm

On Mon, 18 Aug 2008 23:15:59 +0200

The CPU hotplug maintainer is basically "everyone", because many
subsystems need to interact correctly with hotplugging, and many
subsystems can break it.

This one looks like a clocksource/apic/resume problem?  Probably the
next port of call would be Thomas, with Robert looking on.

--

From: Andi Kleen
Date: Monday, August 18, 2008 - 6:13 pm

That should actually work in theory. Linux CPU offlining puts CPUs in a state where
they can still process NMIs. Hmm actually there was a change recently to free
their exception stacks. Maybe it's broken now.


The usual problem: the suspend function when interrupts are
already disabled calls smp_call_function which is not allowed with
interrupt off. But at this point all the other CPUs should be already
down anyways, so it should be enough to just drop that.

This patch should fix that problem at least by fixing cpu hotplug&
suspend support. Untested.

-Andi

From: Vegard Nossum
Date: Tuesday, August 19, 2008 - 12:12 am

This gets rid of the suspend warnings, so it looks to be a step in the
right direction! FWIW, you may add my Tested-by to the patch.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 2:49 am

i've queued it up in tip/x86/oprofile - thanks guys.

Andi, another issue: your patch had 5 spurious whitespace errors. This 
is a reoccuring, many years pattern of behavior from you and you need to 
fix your workflow to send less sloppy patches. You should check your 
editor's settings (and generally you should take a good look at patches 
you send out).

I fixed up the problems of this patch, no need to resend.

	Ingo
--

From: Andi Kleen
Date: Tuesday, August 19, 2008 - 5:12 am

I would have thought Robert would take them? He's oprofile

Yes I forgot you define patch cleanness based on white space.

Seriously, It's actually new -- i did this one in git directly instead of
quilt and git-commit doesn't seem to know how to drop them. In my older quilt
workflow they were always automatically dropped since many years. Both emacs

I hope you fixed the typos in the commit log too (it really was just
a RFC test patch)

-Andi
--

From: Robert Richter
Date: Tuesday, August 19, 2008 - 5:37 am

Yes, as soon as I have my own repository available this will change.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

--

From: Johannes Weiner
Date: Tuesday, August 19, 2008 - 5:56 am

You probably misconfigured emacs then.  I ran checkpatch over your patch
and don't see how and why emacs would add trailing whitespace after a
opening brace.

You can use this to be on the safe side:

(add-hook 'before-save-hook 'delete-trailing-whitespace)

	Hannes
--

From: Andi Kleen
Date: Tuesday, August 19, 2008 - 6:18 am

I don't know how, but it does. vim does too sometimes (I use both

That drops it on all. This means when I edit something which already
has trailing spaces then the diff is full of noise. That would
be a cure far worse than the disease[1]

Ideal would be if git-commit dropped them on all new lines, similar to 
git-apply --whitespace=fix and quilt refresh, but it doesn't seem to know how.

But normally maintainers just set these options (at least I do and Andrew
does too) so it doesn't really matter if the submitted patch has trailing 
spaces or not because they disappear on the way up. 

And that said I refuse to write any more on this silly bike shed topic.

-Andi

[1] to be honest I'm not sure i are actually a disease since I don't
know of any single problem they cause.
--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 6:18 am

Thousands of other kernel developers get it right and manage to submit 
clean patches, so i think you are not trying very hard to actually solve 
this problem - and your flippant and obtrusive answers do not make it 

You've put your signoff in there and Vegard has tested it. The general 
principle on lkml is that patches can be marked as do-not-apply-yet by 
adding something like this:

 Not-Signed-off-by: Andi Kleen <ak@linux.intel.com>

or by not adding a SOB line at all. In that case i either ask people for 
the SOB or apply them at my own risk.

Thanks,

	Ingo
--

From: Robert Richter
Date: Monday, September 1, 2008 - 9:34 am

Ingo,

this fix (commit id 1404e403) is not yet upstream for v2.6.27. Could
this be added to the next pull request?

Thanks,

-Robert



-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

--

From: Ingo Molnar
Date: Friday, September 5, 2008 - 10:58 am

hi Robert,


it's already upstream, as commit 80a8c9fffa. I've zapped the duplicate 
1404e403 from x86/oprofile.

	Ingo
--

From: Robert Richter
Date: Friday, September 5, 2008 - 11:59 am

Ah, I stumbled other this because it is still listed in the shortlog:

 $ git shortlog linux-2.6/master..tip/master | grep oprofile
       x86: fix oprofile + hibernation badness
 ...

But anyway, I found the other commit as well, thanks.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

--

From: Ingo Molnar
Date: Friday, September 5, 2008 - 1:31 pm

yeah - it's still in the integration branch - but it doesnt hurt as it's 
same-content patch (it got cherry-picked over into an urgent branch). 
The duplication should go away on the next tip/master integration run 
(tomorrow-ish) - please let me know if it doesnt.

	Ingo
--

Previous thread: [git pull] IDE fixes by Bartlomiej Zolnierkiewicz on Monday, August 18, 2008 - 1:22 pm. (1 message)

Next thread: [PATCH 1/2] byteorder: add new headers for make headers-install by Harvey Harrison on Monday, August 18, 2008 - 1:38 pm. (3 messages)