login
Header Space

 
 

Suspend and Resume with ACPI

September 23, 2007 - 9:35pm
Submitted by Jeremy on September 23, 2007 - 9:35pm.
Linux news

"It took me quite a while to realize the real root cause of the VAIO - and probably many other machines - suspend/resume regressions, which were unearthed by the dyntick / clockevents patches," Thomas Gleixner explained regarding two patches for fixing suspend issues that Andrew Morton experienced with his VAIO laptop. He continued, "we disable a lot of ACPI/BIOS functionality during suspend, but we keep the lower idle C-states functionality active across suspend/resume. It seems that this causes trouble with certain BIOSes, but I assume that the problem is more wide spread and just not surfacing due to the various scenarios in which a machine goes into suspend/resume." Thomas concluded, "I really hope that this two patches finally set an end to the 'jinxed VAIO heisenbug series', which started when we removed the periodic tick with the clockevents/dyntick patches."

Linus Torvalds expressed some concerns, "the patches look fine, but I somehow have this slight feeling that you gave up a bit too soon on the '*why* does this happen?' question." He agreed that at that point there was a problem with ACPI, but cautioned that this could be triggered by another bug, "in particular, I also suspect that this may not really fix the problem - maybe it just makes the window sufficiently small that it no longer triggers. Because we don't necessarily understand what the real background for the problem is, I'm not sure we can say that it is solved." Linus concluded, "but hey, I think I'll apply the patches as-is. I'd just feel even better if we actually understood *why* doing the CPU Cx states is not something we can do around the suspend code!"


From: Thomas Gleixner <tglx@...>
Subject: [patch 0/2] suspend/resume regression fixes
Date: Sep 22, 6:29 pm 2007

Sorry, it took me quite a while to realize the real root cause of the
VAIO - and probably many other machines - suspend/resume regressions,
which were unearthed by the dyntick / clockevents patches.

We disable a lot of ACPI/BIOS functionality during suspend, but we
keep the lower idle C-states functionality active across
suspend/resume. It seems that this causes trouble with certain BIOSes,
but I assume that the problem is more wide spread and just not
surfacing due to the various scenarios in which a machine goes into
suspend/resume. I spent some quality time to figure out a set of debug
mechanisms, which did not influence the problem. So it is quite likely
that a lot of machines might be affected by this, but due to the
configuration, interrupt scenarios, .... the problem just does
not show up. 

My final enlightment was, when I removed the ACPI processor module,
which controls the lower idle C-states, right before resume; this
worked fine all the time even without all the workaround hacks.

I really hope that this two patches finally set an end to the "jinxed
VAIO heisenbug series", which started when we removed the periodic
tick with the clockevents/dyntick patches.

Venki, can you please add the analogous fix to the cpuidle patch set ?

Thanks,

	tglx
-- 

-

From: Thomas Gleixner <tglx@...> Subject: [patch 1/2] ACPI: disable lower idle C-states across suspend/resume Date: Sep 22, 6:29 pm 2007 device_suspend() calls ACPI suspend functions, which seems to have undesired side effects on lower idle C-states. It took me some time to realize that especially the VAIO BIOSes (both Andrews jinxed UP and my elfstruck SMP one) show this effect. I'm quite sure that other bug reports against suspend/resume about turning the system into a brick have the same root cause. After fishing in the dark for quite some time, I realized that removing the ACPI processor module before suspend (this removes the lower C-state functionality) made the problem disappear. Interestingly enough the propability of having a bricked box is influenced by various factors (interrupts, size of the ram image, ...). Even adding a bunch of printks in the wrong places made the problem go away. The previous periodic tick implementation simply pampered over the problem, which explains why the dyntick / clockevents changes made this more prominent. We avoid complex functionality during the boot process and we have to do the same during suspend/resume. It is a similar scenario and equaly fragile. Add suspend / resume functions to the ACPI processor code and disable the lower idle C-states across suspend/resume. Fall back to the default idle implementation (halt) instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Len Brown <lenb@kernel.org> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> --- drivers/acpi/processor_core.c | 2 ++ drivers/acpi/processor_idle.c | 19 ++++++++++++++++++- include/acpi/processor.h | 2 ++ 3 files changed, 22 insertions(+), 1 deletion(-) Index: linux-2.6/drivers/acpi/processor_core.c =================================================================== --- linux-2.6.orig/drivers/acpi/processor_core.c 2007-09-23 00:01:00.000000000 +0200 +++ linux-2.6/drivers/acpi/processor_core.c 2007-09-23 00:01:00.000000000 +0200 @@ -102,6 +102,8 @@ static struct acpi_driver acpi_processor .add = acpi_processor_add, .remove = acpi_processor_remove, .start = acpi_processor_start, + .suspend = acpi_processor_suspend, + .resume = acpi_processor_resume, }, }; Index: linux-2.6/drivers/acpi/processor_idle.c =================================================================== --- linux-2.6.orig/drivers/acpi/processor_idle.c 2007-09-23 00:01:00.000000000 +0200 +++ linux-2.6/drivers/acpi/processor_idle.c 2007-09-23 00:01:00.000000000 +0200 @@ -325,6 +325,23 @@ static void acpi_state_timer_broadcast(s #endif +/* + * Suspend / resume control + */ +static int acpi_idle_suspend; + +int acpi_processor_suspend(struct acpi_device * device, pm_message_t state) +{ + acpi_idle_suspend = 1; + return 0; +} + +int acpi_processor_resume(struct acpi_device * device) +{ + acpi_idle_suspend = 0; + return 0; +} + static void acpi_processor_idle(void) { struct acpi_processor *pr = NULL; @@ -355,7 +372,7 @@ static void acpi_processor_idle(void) } cx = pr->power.state; - if (!cx) { + if (!cx || acpi_idle_suspend) { if (pm_idle_save) pm_idle_save(); else Index: linux-2.6/include/acpi/processor.h =================================================================== --- linux-2.6.orig/include/acpi/processor.h 2007-09-23 00:01:00.000000000 +0200 +++ linux-2.6/include/acpi/processor.h 2007-09-23 00:01:00.000000000 +0200 @@ -320,6 +320,8 @@ int acpi_processor_power_init(struct acp int acpi_processor_cst_has_changed(struct acpi_processor *pr); int acpi_processor_power_exit(struct acpi_processor *pr, struct acpi_device *device); +int acpi_processor_suspend(struct acpi_device * device, pm_message_t state); +int acpi_processor_resume(struct acpi_device * device); /* in processor_thermal.c */ int acpi_processor_get_limit_info(struct acpi_processor *pr); -- -
From: Thomas Gleixner <tglx@...> Subject: [patch 2/2] clockevents: remove the suspend/resume workaround^Wthinko Date: Sep 22, 6:29 pm 2007 In a desparate attempt to fix the suspend/resume problem on Andrews VAIO I added a workaround which enforced the broadcast of the oneshot timer on resume. This was actually resolving the problem on the VAIO but was just a stupid workaround, which was not tackling the root cause: the assignement of lower idle C-States in the ACPI processor_idle code. The cpuidle patches, which utilize the dynamic tick feature and go faster into deeper C-states exposed the problem again. The correct solution is the previous patch, which prevents lower C-states across the suspend/resume. Remove the enforcement code, including the conditional broadcast timer arming, which helped to pamper over the real problem for quite a time. The oneshot broadcast flag for the cpu, which runs the resume code can never be set at the time when this code is executed. It only gets set, when the CPU is entering a lower idle C-State. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Len Brown <lenb@kernel.org> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> --- kernel/time/tick-broadcast.c | 17 +---------------- 1 file changed, 1 insertion(+), 16 deletions(-) Index: linux-2.6/kernel/time/tick-broadcast.c =================================================================== --- linux-2.6.orig/kernel/time/tick-broadcast.c 2007-09-23 00:00:59.000000000 +0200 +++ linux-2.6/kernel/time/tick-broadcast.c 2007-09-23 00:01:00.000000000 +0200 @@ -382,23 +382,8 @@ static int tick_broadcast_set_event(ktim int tick_resume_broadcast_oneshot(struct clock_event_device *bc) { - int cpu = smp_processor_id(); - - /* - * If the CPU is marked for broadcast, enforce oneshot - * broadcast mode. The jinxed VAIO does not resume otherwise. - * No idea why it ends up in a lower C State during resume - * without notifying the clock events layer. - */ - if (cpu_isset(cpu, tick_broadcast_mask)) - cpu_set(cpu, tick_broadcast_oneshot_mask); - clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT); - - if(!cpus_empty(tick_broadcast_oneshot_mask)) - tick_broadcast_set_event(ktime_get(), 1); - - return cpu_isset(cpu, tick_broadcast_oneshot_mask); + return 0; } /* -- -
From: Linus Torvalds <torvalds@...> Subject: Re: [patch 0/2] suspend/resume regression fixes Date: Sep 22, 6:59 pm 2007 On Sat, 22 Sep 2007, Thomas Gleixner wrote: > > My final enlightment was, when I removed the ACPI processor module, > which controls the lower idle C-states, right before resume; this > worked fine all the time even without all the workaround hacks. > > I really hope that this two patches finally set an end to the "jinxed > VAIO heisenbug series", which started when we removed the periodic > tick with the clockevents/dyntick patches. Ok, so the patches look fine, but I somehow have this slight feeling that you gave up a bit too soon on the "*why* does this happen?" question. I realize that the answer is easily "because ACPI screwed up", but I'm wondering if there's something we do to trigger that screw-up. In particular, I also suspect that this may not really fix the problem - maybe it just makes the window sufficiently small that it no longer triggers. Because we don't necessarily understand what the real background for the problem is, I'm not sure we can say that it is solved. The reason I say this is that I have a suspicion on what triggers it. I suspect that the problem is that we do pm_ops->prepare(); disable_nonboot_cpus() suspend_enter(); enable_nonboot_cpus() pm_finish() and here the big thing to notice is that "pm_ops->prepare()" call, which sets the wakup vector etc etc. So maybe the real problem here is that once we've done the "->prepare()" call and ACPI has set up various stuff, we MUST NOT do any calls to any ACPI routines to set low-power states, because the stupid firmware isn't expecting it. Now, if this is the cause, then I think your patch should indeed fix it, since you get called by the early-suspend code (which happens *before* the "->prepare()" call), but at the same time, I wonder if maybe it would be slightly "more correct" to instead of using the suspend/resume callbacks, simply do this in the "acpi_pm_prepare()" stage, since that is likely the thing that triggers it? But hey, I think I'll apply the patches as-is. I'd just feel even better if we actually understood *why* doing the CPU Cx states is not something we can do around the suspend code! Linus -


Linus is wise

September 24, 2007 - 5:54am
Aspid (not verified)

Merely getting rid of a problem is not enough.

You have to understand why it happen, and how you solved it...

solve it? simple, get rid of

September 24, 2007 - 3:38pm
turn_self_off (not verified)

solve it? simple, get rid of acpi...

a bigger mess that that is hard to find, even if efi looks like a good contender...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary