Re: [2.6.25-rc1] System no longer powers off after shutdown

Previous thread: [PATCH][drivers/pnp/pnpacpi/core.c] __initdata is not an identifier by Roel Kluin on Monday, February 11, 2008 - 8:05 am. (1 message)

Next thread: [patch 1/4] mempolicy: convert MPOL constants to enum by David Rientjes on Monday, February 11, 2008 - 8:30 am. (78 messages)
From: Frans Pop
Date: Monday, February 11, 2008 - 8:23 am

In general 2.6.25 if looking quite good on my desktop, but there's one
important issue: the system no longer powers off after shutdown.
This works fine with 2.6.24.

If there are any suggestions about patches to try or commits to revert,
please let me know. If not, I'll run a bisect.

Cheers,
FJP

Base Board Information
        Manufacturer: Intel Corporation
        Product Name: D945GCZ
        Version: AAC99567-502
BIOS Information
        Vendor: Intel Corp.
        Version: NT94510J.86A.4089.2007.0718.0501
        Release Date: 07/18/2007

Processor: Intel(R) Pentium(R) D CPU 3.20GHz

$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 82945G/GZ/P/PL Memory Controller Hub [8086:2770] (rev 02)
00:02.0 VGA compatible controller [0300]: Intel Corporation 82945G/GZ Integrated Graphics Controller [8086:2772] (rev 02)
00:1b.0 Audio device [0403]: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller [8086:27d8] (rev 01)
00:1c.0 PCI bridge [0604]: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 [8086:27d0] (rev 01)
00:1c.2 PCI bridge [0604]: Intel Corporation 82801G (ICH7 Family) PCI Express Port 3 [8086:27d4] (rev 01)
00:1c.3 PCI bridge [0604]: Intel Corporation 82801G (ICH7 Family) PCI Express Port 4 [8086:27d6] (rev 01)
00:1c.4 PCI bridge [0604]: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 [8086:27e0] (rev 01)
00:1c.5 PCI bridge [0604]: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 [8086:27e2] (rev 01)
00:1d.0 USB Controller [0c03]: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #1 [8086:27c8] (rev 01)
00:1d.1 USB Controller [0c03]: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #2 [8086:27c9] (rev 01)
00:1d.2 USB Controller [0c03]: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #3 [8086:27ca] (rev 01)
00:1d.3 USB Controller [0c03]: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #4 [8086:27cb] (rev 01)
00:1d.7 USB Controller [0c03]: Intel Corporation ...
From: Frans Pop
Date: Tuesday, February 12, 2008 - 1:39 pm

(Resending full details as I can't find my previous mail in the archives.)


Don't ask me why, but bisection shows this commit to be the cause of the
failure to power off:
commit c10997f6575f476ff38442fa18fd4a0d80345f9d
Author: Greg Kroah-Hartman <gregkh@suse.de>
Date:   Thu Dec 20 08:13:05 2007 -0800

    Kobject: convert drivers/* from kobject_unregister() to kobject_put()

Because it seemed somewhat unlikely, I have double checked this by doing an
extra compilation for this commit and its predecessor.

Cheers,

From: Greg KH
Date: Tuesday, February 12, 2008 - 1:56 pm

What is the symptom of not powering off?

Can you press SysRq-T and see a task list running and waiting when
things should be shut down?

Do you happen to have a USB storage stick plugged into the system?

thanks,

greg k-h
--

From: Frans Pop
Date: Tuesday, February 12, 2008 - 2:45 pm

Symptom is that the system shuts down normally and completely, it just does 
not power off. Here are the last messages on the console:
Will now halt.
sd 1:0:0:0: [sdb] Synchronizing SCSI cache
sd 1:0:0:0: [sdb] Stopping disk
sd 0:0:0:0: [sda] Synchronizing SCSI cache
sd 0:0:0:0: [sda] Stopping disk
ACPI: PCI interrupt for device 0000:01:00.0 disabled


Nothing. Only USB kbd/mouse.


Note that I've had this issue before with this box:
http://bugzilla.kernel.org/show_bug.cgi?id=6879

Somehow it disappeared when I pulled the extra video card that came with the 
system (no decent driver for it, so no loss). Since then the system has 
always powered off completely reliably.
This time it is a clear and reproducible regression. If we can solve this 
one we might get a better handle on #6879 too.

Cheers,
FJP
--

From: Andrew Morton
Date: Wednesday, February 13, 2008 - 12:54 am

I've been struggling with an identically-manifesting regression on one of
my test machines for a week.  It's due to softlockup changes, and setting
CONFIG_DETECT_SOFTLOCKUP=n "fixes" it.

It sounds unlikely, but I'd suggest that you see if it's the same on your
machine so we're not both chasing the same bug.

--

From: Jeff Chua
Date: Wednesday, February 13, 2008 - 1:23 am

I don't have CONFIG_DETECT_SOFTLOCKUP defined in .config and there's
not option using menuconfig to select this.

I don't know whether my problem is related or not on Lenovo X60s. I
can power-off on shutdown and suspend-to-ram, but screen turns green,
and doesn't power-off on suspend-to-disk. I've to manually press and
hold the power switch to switch off. System is able to resume later.
It was working as recent as last week, but something changed past few
days.


Jeff.
--

From: Frans Pop
Date: Wednesday, February 13, 2008 - 2:24 am

Unsetting CONFIG_DETECT_SOFTLOCKUP does not help in my case, but thanks for 
the suggestion.
--

From: Frans Pop
Date: Wednesday, February 13, 2008 - 4:39 am

I already noticed yesterday that there's one hunk in that commit that's not
a straight replacement:
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 9e102af..5efd555 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1030,8 +1030,6 @@ static int __cpufreq_remove_dev (struct sys_device * sys_dev)

        unlock_policy_rwsem_write(cpu);

-       kobject_unregister(&data->kobj);
-
        kobject_put(&data->kobj);

        /* we need to make sure that the underlying kobj is actually


So, just on the off chance, I applied the patch below and bingo, the system
powers off again. I doubt this will be the correct solution, but just in
case it is, here's my signed off. A comment why the double put is needed
would probably be good though.

Signed-off-by: Frans Pop <elendil@planet.nl>

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 64926aa..9dbaac6 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1058,6 +1058,7 @@ static int __cpufreq_remove_dev (struct sys_device * sys_dev)
 	unlock_policy_rwsem_write(cpu);
 
 	kobject_put(&data->kobj);
+	kobject_put(&data->kobj);
 
 	/* we need to make sure that the underlying kobj is actually
 	 * not referenced anymore by anybody before we proceed with
--

From: Greg KH
Date: Wednesday, February 13, 2008 - 9:58 am

There is a bug in the cpufreq kref logic that makes this "double put"
necessary.  A real fix has already been posted to solve this issue, and
I think it should be on it's way to Linus for -rc2 already.

Please let me know if -rc2 comes out without this needed fix.

thanks,

greg k-h
--

From: Rafael J. Wysocki
Date: Wednesday, February 13, 2008 - 11:55 am

Can you point me to the fix, please?

Thanks,
Rafael
--

From: Greg KH
Date: Thursday, February 14, 2008 - 11:59 pm

I swear someone else sent this in, but my archives don't show it at all.

I think the patch below should solve this, but I need someone to test
it.

thanks,

greg k-h

---
 drivers/cpufreq/cpufreq.c |    8 --------
 1 file changed, 8 deletions(-)

--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1006,14 +1006,6 @@ static int __cpufreq_remove_dev (struct 
 	}
 #endif
 
-
-	if (!kobject_get(&data->kobj)) {
-		spin_unlock_irqrestore(&cpufreq_driver_lock, flags);
-		cpufreq_debug_enable_ratelimit();
-		unlock_policy_rwsem_write(cpu);
-		return -EFAULT;
-	}
-
 #ifdef CONFIG_SMP
 
 #ifdef CONFIG_HOTPLUG_CPU
--

From: Jeff Chua
Date: Friday, February 15, 2008 - 1:52 am

I tested but it doesn't fix the problem for me. May be my problem is
different ... as my X60s just doesn't power-off on suspend-to-disk.

My .config says ...
     # CONFIG_CPU_FREQ is not set
     # CONFIG_CPU_IDLE is not set


On Wed, Feb 13, 2008 at 3:54 PM, Andrew Morton

Also, I've tried CONFIG_DETECT_SOFTLOCKUP=n, but this doesn't fix it either.


Here's the last dmesg after suspend-to-disk and hang there...

CPU 1 is now offline
SMP alternatives: switching to UP code
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.00 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
PM: Shrinking memory...  ^H-^Hdone (0 pages freed)
PM: Freed 0 kbytes in 0.10 seconds (0.00 MB/s)
ACPI: Preparing to enter system sleep state S4
Suspending console(s)

[ ... it just hangs here ... press power-switch does the job, and
system is able to resume upon powering on ]


Thanks,
Jeff.
--

From: Greg KH
Date: Friday, February 15, 2008 - 2:00 pm

Wait, this is a suspend-to-disk issue.  Totally different than the "will
not power off" issue.

Can you start a new thread on this, and add the suspend people to it?

thanks,

greg k-h
--

From: Frans Pop
Date: Wednesday, February 13, 2008 - 12:28 pm

OK, great.

Do you think that #6879 could be caused by a similar issue elsewhere in the 
tree? Can you give me some pointers on how I could find out (debugging to 

Will do.
--

From: Yinghai Lu
Date: Thursday, February 14, 2008 - 4:38 pm

after disable cpufreq, i got

ACPI: Preparing to enter system sleep state S5
Disabling non-boot CPUs ...
kvm: disabling virtualization on CPU1
CPU 1 is now offline
CPU1 is down
kvm: disabling virtualization on CPU2
CPU 2 is now offline
================> hang here.

but x86.git/mm could go through down all the cpus....

interesting...

YH
--

From: Ingo Molnar
Date: Thursday, February 14, 2008 - 4:48 pm

i suspect some kobject related race, and i have the feeling this all is 
timing dependent.

Andrew started seeing reboot hangs roughly around the time when the 
kobject changes went upstream. Given that x86.git had flux in that 
timeframe too i couldnt be sure what caused them.

i have the fixlet below in x86.git but it didnt solve Andrew's problem 
so it's parking now at the end of the queue, with no clear purpose in 
life :-) If it would solve someone's problem it might be revitalized.

Note: this does not fix any particular bug i know about, it's just a 
hack.

	Ingo

------------------------------>
Subject: x86: highprio shutdown hack
From: Ingo Molnar <mingo@elte.hu>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/reboot.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

Index: linux-x86.q/arch/x86/kernel/reboot.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/reboot.c
+++ linux-x86.q/arch/x86/kernel/reboot.c
@@ -396,8 +396,20 @@ static void native_machine_shutdown(void
 	if (!cpu_isset(reboot_cpu_id, cpu_online_map))
 		reboot_cpu_id = smp_processor_id();
 
-	/* Make certain I only run on the appropriate processor */
-	set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+	/*
+	 * Make certain we only run on the appropriate processor,
+	 * and with sufficient priority:
+	 */
+	{
+		struct sched_param schedparm;
+		int ret;
+
+		schedparm.sched_priority = 99;
+		ret = sched_setscheduler(current, SCHED_RR, &schedparm);
+		WARN_ON_ONCE(1);
+
+		set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+	}
 
 	/* O.K Now that I'm on the appropriate processor,
 	 * stop all of the others.
--

From: Yinghai Lu
Date: Thursday, February 14, 2008 - 5:06 pm

so I got

------------[ cut here ]------------
WARNING: at arch/x86/kernel/reboot.c:409 native_machine_shutdown+0x5f/0xb4()
Modules linked in:
Pid: 7173, comm: reboot Not tainted 2.6.25-rc1-smp-00168-g458504f-dirty #33

Call Trace:
 [<ffffffff802521d8>] warn_on_slowpath+0x64/0x8e
 [<ffffffff802464c8>] enqueue_task+0x5c/0x7e
 [<ffffffff8027765c>] rt_mutex_adjust_pi+0x28/0x94
 [<ffffffff8024c075>] sched_setscheduler+0x304/0x33c
 [<ffffffff80237feb>] native_machine_shutdown+0x5f/0xb4
 [<ffffffff80237f6e>] native_machine_restart+0x2e/0x4c
 [<ffffffff8026234b>] sys_reboot+0x140/0x1b2
 [<ffffffff8029bbd2>] handle_mm_fault+0x380/0x705
 [<ffffffff802cc311>] d_kill+0x50/0x7c
 [<ffffffff8097d000>] do_page_fault+0x3bd/0x7c9
 [<ffffffff804d58ae>] __up_read+0x27/0xb5
 [<ffffffff8022432b>] system_call_after_swapgs+0x7b/0x80

---[ end trace eb0e49090acb42b5 ]---
--

From: Yinghai Lu
Date: Thursday, February 14, 2008 - 7:14 pm

it seems only happen
1. first hang with cpufreq enabled.
2. reboot to kernel with cpufreq disable will have problem.

wonder if different cpu freq out sync and next kernel with reboot
doesn't have cpufreq so it ....
-- with warm reset doesn't do the right job to sync freq again.

Greg,

where is patch to fix cpufreq problem?

YH
--

From: Yinghai Lu
Date: Thursday, February 14, 2008 - 8:45 pm

ACPI: Preparing to enter system sleep state S5
Disabling non-boot CPUs ...
kvm: disabling virtualization on CPU1
CPU 1 is now offline
1
2
3
4
5
CPU1 is down
kvm: disabling virtualization on CPU2
CPU 2 is now offline
1
2
3
4
5
CPU2 is down
kvm: disabling virtualization on CPU3
CPU 3 is now offline
========> some time later
Clocksource tsc unstable (delta = 515397918052 ns)
Time: hpet clocksource has been installed.


it hangs in

raw_notifier_call_chain(&cpu_chain, CPU_DEAD | mode, hcpu)== NOTIFY_BAD);

there are several nb, not sure which one cause hang.

8 hrtimer.c        hrtimers_init            1505
register_cpu_notifier(&hrtimers_nb);
9 rcuclassic.c     __rcu_init                570 register_cpu_notifier(&rcu_nb);
a rcupreempt.c     __rcu_init                892
register_cpu_notifier(&rcu_nb); ===> not used
b sched.c          migration_init           5951
register_cpu_notifier(&migration_notifier);
c softirq.c        spawn_ksoftirqd           645
register_cpu_notifier(&cpu_nfb);
d softlockup.c     spawn_softlockup_task     310
register_cpu_notifier(&cpu_nfb);
e timer.c          init_timers              1367
register_cpu_notifier(&timers_nb);
f page-writeback.c page_writeback_init       775
register_cpu_notifier(&ratelimit_nb);
g page_alloc.c     setup_per_cpu_pageset    2744
register_cpu_notifier(&pageset_notifier);
h slab.c           kmem_cache_init          1638
register_cpu_notifier(&cpucache_notifier);  ==> not used
i slub.c           kmem_cache_init          3036
register_cpu_notifier(&slab_notifier);
j vmstat.c         setup_vmstat              855
register_cpu_notifier(&vmstat_notifier);
k kvm_main.c       kvm_init                 1328 r =
register_cpu_notifier(&kvm_cpu_notifier);

maybe the one in softlockup.c?

YH
--

From: Greg KH
Date: Thursday, February 14, 2008 - 11:52 pm

Ugh, sorry, I was mistaken, it's not a cpufreq issue, it's a
CONFIG_DETECT_SOFTLOCKUP issue.  Or that is what I was told before.

But the fact that you fixed the problem with an extra kobject_put()
makes me worry.  There might be a reference issue still there.  I'll
look into it.

thanks,

greg k-h
--

From: Yinghai Lu
Date: Friday, February 15, 2008 - 1:41 am

could  be two issues:
one in cpufreq, and one in detect softlockup...

YH
--

From: Greg KH
Date: Friday, February 15, 2008 - 1:58 pm

Looks like it's that way :)
--

From: Greg KH
Date: Thursday, February 14, 2008 - 11:57 pm

I swear someone sent this patch in before.  Can you try this one below,
there seems to be an imbalance with kobject_get and _put.

thanks,

greg k-h

---
 drivers/cpufreq/cpufreq.c |    8 --------
 1 file changed, 8 deletions(-)

--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1006,14 +1006,6 @@ static int __cpufreq_remove_dev (struct 
 	}
 #endif
 
-
-	if (!kobject_get(&data->kobj)) {
-		spin_unlock_irqrestore(&cpufreq_driver_lock, flags);
-		cpufreq_debug_enable_ratelimit();
-		unlock_policy_rwsem_write(cpu);
-		return -EFAULT;
-	}
-
 #ifdef CONFIG_SMP
 
 #ifdef CONFIG_HOTPLUG_CPU
--

From: Frans Pop
Date: Friday, February 15, 2008 - 3:19 am

I did remember seeing this patch before [1] and can confirm that it does 
indeed fix the issue: with this patch applied to 2.6.25 git head my system 
powers off correctly.


--

From: Yinghai Lu
Date: Friday, February 15, 2008 - 12:38 pm

confirmed, with this patch, i still need disable CONFIG_DETECT_SOFTLOCKUP

assume watchdog thread for the dead cpu can not be stopped. hang somewhere.

YH
--

From: Yinghai Lu
Date: Friday, February 15, 2008 - 1:31 pm

Ingo,

with patch (http://lkml.org/lkml/2008/2/8/342) and following patch, it
could power off with CONFIG_DETECT_SOFTLOCKUP config

diff --git a/kernel/softlockup.c b/kernel/softlockup.c
index 7c2da88..c16a658 100644
--- a/kernel/softlockup.c
+++ b/kernel/softlockup.c
@@ -282,12 +282,12 @@ cpu_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
        case CPU_UP_CANCELED_FROZEN:
                if (!per_cpu(watchdog_task, hotcpu))
                        break;
-               /* Unbind so it can run.  Fall thru. */
-               kthread_bind(per_cpu(watchdog_task, hotcpu),
-                            any_online_cpu(cpu_online_map));
+               /* Fall thru. */
        case CPU_DEAD:
        case CPU_DEAD_FROZEN:
                p = per_cpu(watchdog_task, hotcpu);
+               /* Unbind so it can run. */
+               kthread_bind(p, any_online_cpu(cpu_online_map));
                per_cpu(watchdog_task, hotcpu) = NULL;
                kthread_stop(p);
                break;

but got WARN on every CPU.

ACPI: Preparing to enter system sleep state S5
Disabling non-boot CPUs ...
kvm: disabling virtualization on CPU1
CPU 1 is now offline
------------[ cut here ]------------
WARNING: at kernel/kthread.c:176 cpu_callback+0x14f/0x177()
Modules linked in:
Pid: 7224, comm: halt Not tainted 2.6.25-rc1-smp-00266-g4ee29f6-dirty #110

Call Trace:
 [<ffffffff80243c61>] warn_on_slowpath+0x51/0x63
 [<ffffffff80257e00>] ktime_get_ts+0x3d/0x48
 [<ffffffff8023b160>] hrtick_start_fair+0xe1/0x129
 [<ffffffff8023a049>] enqueue_task+0x4d/0x58
 [<ffffffff8023ceda>] try_to_wake_up+0x1ae/0x1bf
 [<ffffffff80848796>] cpu_callback+0x14f/0x177
 [<ffffffff802747e4>] writeback_set_ratelimit+0x17/0x5d
 [<ffffffff8084dd78>] notifier_call_chain+0x29/0x4c
 [<ffffffff8026083c>] _cpu_down+0x18e/0x251
 [<ffffffff80260a3d>] disable_nonboot_cpus+0x50/0xd7
 [<ffffffff8024feb4>] kernel_power_off+0x21/0x3a
 [<ffffffff802500de>] sys_reboot+0xee/0x187
 ...
From: Greg KH
Date: Friday, February 15, 2008 - 1:58 pm

Great, thanks for testing and letting us know.

greg k-h
--

From: Greg KH
Date: Friday, February 15, 2008 - 1:58 pm

Ah, thanks, for some reason I couldn't find this in my archives.

I'll add this to my queue to go to Linus.

thanks,

greg k-h
--

From: Frans Pop
Date: Wednesday, February 13, 2008 - 1:41 am

I was wrong :-(

I'd not really done any real workkkk under 2.6.25 yet, but now while running 
a kernel compile with -j4 (single processor, dual core Pentium D), I see 
this behavior. The mouse cursor moves a bit jerky and I sometimes get key 
presses repeated.

While I'm typing this, the load lowers a bit and immediately things become 
smoother and the key repeats seem to vanish.

(The key repeats in the subject and para above are real examples of this, 
not typo's.)

The keyboard repeat issue looks like what was reported in [1], but for me 
this is very definitely an new issue that did not appear with 2.6.24 or 
earlier.

Cheers,
FJP

[1] http://lkml.org/lkml/2008/2/6/100
--

From: Pavel Machek
Date: Friday, February 15, 2008 - 4:58 pm

From: Gabriel C
Date: Friday, February 15, 2008 - 6:23 pm

I have that problem on my Dell Precision WorkStation , as soon I stress the box
a bit keyboard is going mad.

Gabriel
--

From: Mike Galbraith
Date: Friday, February 15, 2008 - 11:19 pm

Sounds like you may have CONFIG_GROUP_SCHED set?  Bisection fingered
6b2d7700266b9402e12824e11e0099ae6a4a6a79 as the source here.

	-Mike

--

From: Gabriel C
Date: Saturday, February 16, 2008 - 3:09 am

I can't confirm is that commit because I cannot revert it clean but 
I can confirm there is something wrong with CONFIG_GROUP_SCHED.

( maybe Peter or Ingo knows =) )

Turning CONFIG_GROUP_SCHED off on this box fixes the mouse and keyboard problems. 

Gabriel
--

From: Mike Galbraith
Date: Saturday, February 16, 2008 - 7:03 am

Yeah, looks like it is the same issue.  I'd suggest that folks who are
hitting this disable CONFIG_GROUP_SCHED, and flog other parts of Funky
Weasel's anatomy while it's being sorted.

	-Mike

--

Previous thread: [PATCH][drivers/pnp/pnpacpi/core.c] __initdata is not an identifier by Roel Kluin on Monday, February 11, 2008 - 8:05 am. (1 message)

Next thread: [patch 1/4] mempolicy: convert MPOL constants to enum by David Rientjes on Monday, February 11, 2008 - 8:30 am. (78 messages)