Re: 2.6.27-rc1: critical thermal shutdown on thinkpad x60

Previous thread: 2.6.26.1 hangs on dl385 by Peter Palfrader on Wednesday, August 6, 2008 - 5:33 am. (2 messages)

Next thread: [PATCH 0/5] Support for Arcom/Eurotech Viper SBC by Marc Zyngier on Wednesday, August 6, 2008 - 6:19 am. (10 messages)
From: Pavel Machek
Date: Wednesday, August 6, 2008 - 2:02 am

Hi!

Aug  6 11:00:10 amd kernel: ACPI: Critical trip point
Aug  6 11:00:10 amd kernel: Critical temperature reached (128 C),
shutting down.
Aug  6 11:00:10 amd shutdown[24414]: shutting down for system halt

...and machine went down at that point :-(.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Karel Zak
Date: Thursday, August 7, 2008 - 4:34 am

It seems it like a bad day for ThinkPads:

Aug  6 20:05:36 nb kernel: ACPI: Critical trip point
Aug  6 20:05:36 nb kernel: Critical temperature reached (128 C),
shutting down.

 Kernel 2.6.26-136.fc10.x86_64 on x61.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
--

From: Fabio Comolli
Date: Thursday, August 7, 2008 - 6:05 am

Seems not only limited to ThinkPads as my HP laptop started to shut
down for thermal events with 27-rc1 too. It happened twice (one with
-rc1 and another with -rc2). Never happened before.


--

From: Yves-Alexis Perez
Date: Thursday, August 7, 2008 - 7:41 am

Global warming.
-- 
Yves-Alexis
--

From: Andi Kleen
Date: Thursday, August 7, 2008 - 9:01 am

I hope you can easily reproduce it?

So it's new in 2.6.27rc1 and wasn't in 2.6.26? Can you please
double check that? Are there are new warnings in the boot logs
from ACPI compared to .26?

I looked through the pile of patches that went in for ACPI and the 
only candidate that might have imho caused this would be 
ea51011a27db48ea0a80a5e20de3969b292d5d4d. Can you please 
try reverting that. If that doesn't help a full bisect will be needed.

-Andi
--

From: Pavel Machek
Date: Tuesday, August 12, 2008 - 2:41 am

Not that one :-(. Thinkpad does not even have fan device: it is
controlled by hardware.
									Pavel

--- /tmp/dmesg.26	2008-08-12 11:38:44.000000000 +0200
+++ /tmp/dmesg.rc2	2008-08-12 11:15:44.000000000 +0200
@@ -1,4 +1,4 @@
-Linux version 2.6.26 (pavel@amd) (gcc version 4.1.3 20071209 (prerelease) (Debian 4.1.2-18)) #313 SMP Mon Jul 14 08:33:14 CEST 2008
+Linux version 2.6.27-rc2 (pavel@amd) (gcc version 4.1.3 20071209 (prerelease) (Debian 4.1.2-18)) #322 SMP Thu Aug 7 11:58:09 CEST 2008
 PAT disabled. Not yet verified on this CPU type.
 BIOS-provided physical RAM map:
  BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
@@ -16,31 +16,13 @@
  BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
  BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
-1142MB HIGHMEM available.
-896MB LOWMEM available.
-found SMP MP-table at [c00f67f0] 000f67f0
-Entering add_active_range(0, 0, 521936) 0 entries of 256 used
-Zone PFN ranges:
-  DMA             0 ->     4096
-  Normal       4096 ->   229376
-  HighMem    229376 ->   521936
-Movable zone start PFN for each node
-early_node_map[1] active PFN ranges
-    0:        0 ->   521936
-On node 0 totalpages: 521936
-  DMA zone: 32 pages used for memmap
-  DMA zone: 0 pages reserved
-  DMA zone: 4064 pages, LIFO batch:0
-  Normal zone: 1760 pages used for memmap
-  Normal zone: 223520 pages, LIFO batch:31
-  HighMem zone: 2286 pages used for memmap
-  HighMem zone: 290274 pages, LIFO batch:31
-  Movable zone: 0 pages used for memmap
+last_pfn = 0x7f6d0 max_arch_pfn = 0x100000
+kernel direct mapping tables up to 38000000 @ 7000-c000
 DMI present.
 ACPI: RSDP 000F67C0, 0024 (r2 LENOVO)
 ACPI: XSDT 7F6D191C, 0084 (r1 LENOVO TP-7B        2140  LTP        0)
 ACPI: FACP 7F6D1A00, 00F4 (r3 LENOVO TP-7B        2140 LNVO        1)
-ACPI Warning (tbfadt-0442): Optional field "Gpe1Block" has zero address or length: 000000000000102C/0 [20080321]
+ACPI Warning ...
From: Andi Kleen
Date: Tuesday, August 12, 2008 - 3:54 am

Does this mean you can easily reproduce it?  

Ok it was just a long shot anyways.

-Andi
--

From: Milan Broz
Date: Tuesday, August 12, 2008 - 4:07 am

Hi,
I see exactly the same on my x60s, but during upgrade to 2.6.26.2.

I found that (at least in my case) the problem is, that in
2.6.25 the core frequency drop to 1GHz (instead of 1.67GHz) when
the temperature is above some limit.

Now, the CPU cores remains on 1.67GHz and fan is unable to cool them properly
under heavy load (even if I set "level disengaged" through thinkpad fan control,
temperature sensor shows after a while 128 C (probably not real temp,
I expect some critical flag => and it properly switch off the system...)

(I had bad reproducer script in bisect and bisect failed, so I'll try it again,
but anyway, for me the bug is even in 2.6.26 tree. It never happened in 2.6.25.)

Milan
--

From: Pavel Machek
Date: Tuesday, August 12, 2008 - 4:26 am

How do you control fans? I could not get anything but -EINVAL from IBM

Hmmm... that's seriously strange. I definitely don't see it in
2.6.26. Maybe it is config dependend?! (Attaching my 2.6.27-rc2
failing config.)
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
From: Milan Broz
Date: Tuesday, August 12, 2008 - 4:44 am

yes. maybe some userspace tool controlling frequency is involved, no idea yet.

No, it is not ok.

you need add fan_control=1 to thinkpad_acpi module

http://www.thinkwiki.org/wiki/How_to_control_fan_speed

hm. strange, I'll try this config too...

Milan
--

From: Pavel Machek
Date: Tuesday, August 12, 2008 - 4:55 am

So it definitely is in 2.6.26.2, and it definitely is in 2.6.26?


Thanks for pointers!

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Rafael J. Wysocki
Date: Tuesday, August 12, 2008 - 7:34 am

Pavel, can you check if the state of the fan(s) change while the thermal trip
points are being passed?

As I said in http://bugzilla.kernel.org/show_bug.cgi?id=11281, I suspect that
this mechanism may be broken.

Thanks,
Rafael
--

From: Milan Broz
Date: Tuesday, August 12, 2008 - 7:57 am

The bug is _not_ in 2.6.26, it was introduced in 2.6.26.1.

The problem is, that now the CPU frequency doesn't decrease at some
temperature level and fan is unable to cool it properly.

bisect on 2.6.26.y tree finished in this patch:
(I expect similar patch in 2.6.27-rc)

commit 04f496871e8af87a1e40c504371a206fd7389193
Author: Thomas Renninger <trenn@suse.de>
Date:   Wed Jul 30 18:20:10 2008 +0000

    cpufreq acpi: only call _PPC after cpufreq ACPI init funcs got called already

    commit a1531acd43310a7e4571d52e8846640667f4c74b upstream

    Ingo Molnar provided a fix to not call _PPC at processor driver
    initialization time in "[PATCH] ACPI: fix cpufreq regression" (git
    commit e4233dec749a3519069d9390561b5636a75c7579)

    But it can still happen that _PPC is called at processor driver
    initialization time.

    This patch should make sure that this is not possible anymore.



That seems strange to me... please could anyone verify that it 
on some other x60?

Milan
--

From: Milan Broz
Date: Tuesday, August 12, 2008 - 8:48 am

and this seems to fix it for me:
--

Do not use unsigned int if there is test for negative number...

See drivers/acpi/processor_perflib.c
  static unsigned int ignore_ppc = -1;
...
  if (event == CPUFREQ_START && ignore_ppc <= 0) {
       ignore_ppc = 0;
...

Signed-off-by: Milan Broz <mbroz@redhat.com>
---
 drivers/acpi/processor_perflib.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.26.y/drivers/acpi/processor_perflib.c
===================================================================
--- linux-2.6.26.y.orig/drivers/acpi/processor_perflib.c	2008-08-12 17:20:07.000000000 +0200
+++ linux-2.6.26.y/drivers/acpi/processor_perflib.c	2008-08-12 17:35:53.000000000 +0200
@@ -70,7 +70,7 @@ static DEFINE_MUTEX(performance_mutex);
  *  0 -> cpufreq low level drivers initialized -> consider _PPC values
  *  1 -> ignore _PPC totally -> forced by user through boot param
  */
-static unsigned int ignore_ppc = -1;
+static int ignore_ppc = -1;
 module_param(ignore_ppc, uint, 0644);
 MODULE_PARM_DESC(ignore_ppc, "If the frequency of your machine gets wrongly" \
 		 "limited by BIOS, this should help");


--

From: Thomas Renninger
Date: Tuesday, August 12, 2008 - 9:01 am

Hmm, the machine should still not shut down. We need the virtual
Ohh dear..., what kind of obvious bug have I introduced.

Thanks a lot!

         Thomas
--

From: Pavel Machek
Date: Wednesday, August 13, 2008 - 12:08 am

Won't help here.

We already do have real passive trip point on the other thermal zone,
and the zone that actually forces shutdown goes 95->128C instantly
(see that DSDT). Virtual passive trip point at 115C will not help
anything.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Andi Kleen
Date: Tuesday, August 12, 2008 - 9:28 am

> and this seems to fix it for me:

Great. Thanks for the patch. I wonder why gcc didn't warn about this.

-Andi

--

From: Dominik Brodowski
Date: Tuesday, August 12, 2008 - 11:30 am

Hi,

			    ^^^^
follow-up change?

Best,
	Dominik
--

From: Andi Kleen
Date: Tuesday, August 12, 2008 - 11:59 am

I'll fix it in the patch, thanks.

-Andi
--

From: Rafael J. Wysocki
Date: Tuesday, August 12, 2008 - 12:56 pm

Is the complete patch available anywhere?  I need a link to it for the list of
regressions.

Thanks,
Rafael
--

From: Milan Broz
Date: Wednesday, August 13, 2008 - 3:39 am

yep, thanks.
I am running my x60s with this patch now:

--

Fix signed parameter in ACPI frequency notifier.

  static unsigned int ignore_ppc = -1;
...
       if (event == CPUFREQ_START && ignore_ppc <= 0) {
               ignore_ppc = 0;
...

Signed-off-by: Milan Broz <mbroz@redhat.com>
---
 drivers/acpi/processor_perflib.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.26.y/drivers/acpi/processor_perflib.c
===================================================================
--- linux-2.6.26.y.orig/drivers/acpi/processor_perflib.c	2008-08-12 17:20:07.000000000 +0200
+++ linux-2.6.26.y/drivers/acpi/processor_perflib.c	2008-08-13 09:32:38.000000000 +0200
@@ -70,8 +70,8 @@ static DEFINE_MUTEX(performance_mutex);
  *  0 -> cpufreq low level drivers initialized -> consider _PPC values
  *  1 -> ignore _PPC totally -> forced by user through boot param
  */
-static unsigned int ignore_ppc = -1;
-module_param(ignore_ppc, uint, 0644);
+static int ignore_ppc = -1;
+module_param(ignore_ppc, int, 0644);
 MODULE_PARM_DESC(ignore_ppc, "If the frequency of your machine gets wrongly" \
 		 "limited by BIOS, this should help");
 


--

From: Milan Broz
Date: Thursday, August 14, 2008 - 6:56 am

(adding cc: stable, the bug is also in 2.6.26.2)

yep, thanks.
I am running my x60s with this patch now:

--

Fix signed parameter in ACPI frequency notifier.

  static unsigned int ignore_ppc = -1;
...
       if (event == CPUFREQ_START && ignore_ppc <= 0) {
               ignore_ppc = 0;
...

Signed-off-by: Milan Broz <mbroz@redhat.com>
---
 drivers/acpi/processor_perflib.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.26.y/drivers/acpi/processor_perflib.c
===================================================================
--- linux-2.6.26.y.orig/drivers/acpi/processor_perflib.c	2008-08-12 17:20:07.000000000 +0200
+++ linux-2.6.26.y/drivers/acpi/processor_perflib.c	2008-08-13 09:32:38.000000000 +0200
@@ -70,8 +70,8 @@ static DEFINE_MUTEX(performance_mutex);
  *  0 -> cpufreq low level drivers initialized -> consider _PPC values
  *  1 -> ignore _PPC totally -> forced by user through boot param
  */
-static unsigned int ignore_ppc = -1;
-module_param(ignore_ppc, uint, 0644);
+static int ignore_ppc = -1;
+module_param(ignore_ppc, int, 0644);
 MODULE_PARM_DESC(ignore_ppc, "If the frequency of your machine gets wrongly" \
 		 "limited by BIOS, this should help");
 



--

From: Pavel Machek
Date: Wednesday, August 13, 2008 - 12:39 am

Tested-by: Pavel Machek <pavel@suse.cz>

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Pavel Machek
Date: Wednesday, August 13, 2008 - 12:39 am

Verified. Your patch from the next email fixes the problem here.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Matthew Garrett
Date: Tuesday, August 12, 2008 - 8:32 am

Thinkpads don't expose fans as ACPI devices, so there's no active trip 
points.
-- 
Matthew Garrett | mjg59@srcf.ucam.org
--

From: Rafael J. Wysocki
Date: Tuesday, August 12, 2008 - 12:57 pm

I didn't know that, sorry.
--

From: Henrique de Moraes Holschuh
Date: Wednesday, August 13, 2008 - 1:13 pm

thinkpad-acpi will regard 128 and -128 as invalid sensors, because that's
how they are used in some BIOSes (and ECs).  We used to bother only with
-128, but Lenovo did something wierd in one of the EC firmwares and I had to
add +128 too.  That masks the "help, I am melting" reading.

I have noted that in my TODO, let's see if I can make that into a quirk so
that you won't get -EINVAL anymore.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--

From: Pavel Machek
Date: Wednesday, August 13, 2008 - 1:28 pm

It was simpler than that. I did not pass "fan_control=1" option.

(Actually... I do not think that option is needed. If fan control is
known to work, it should be just enabled...)
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Henrique de Moraes Holschuh
Date: Wednesday, August 13, 2008 - 1:42 pm

I require explicit user permission to activate knobs that are that
dangerous, and actively frowned upon by the manufacturer.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--

From: Pavel Machek
Date: Wednesday, August 13, 2008 - 1:55 pm

Well, it is not more dangerous than 2.6.26.2 (will overheat your
thinkpad, causing critical shutdown). Plus, acpi fans are
controllable/overridable by default from /proc, and 'echo "level 7" >
fan' is not something you can do accidentally...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Pavel Machek
Date: Tuesday, August 12, 2008 - 4:02 am

It is easily reproduced, but it takes 10+ minutes, and at the end
machine is so hot it will not even power up. So yes, bisect is
possible, but I'd prefer to avoid it.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Zhang Rui
Date: Tuesday, August 12, 2008 - 5:55 pm

that's weird.
ACPICA should be 20080609 in 2.6.26.
Pavel, can you please make a double check? :)

thanks,
rui

--

From: Pavel Machek
Date: Thursday, August 7, 2008 - 2:13 am

This one is quite repeatable:

Aug  7 10:46:24 amd dhclient: DHCPACK from 10.20.0.2
Aug  7 10:46:24 amd dhclient: bound to 10.20.5.28 -- renewal in 7200
seconds.
Aug  7 10:50:46 amd kernel: thinkpad_acpi: unhandled HKEY event 0x6022
Aug  7 10:51:03 amd last message repeated 48 times
Aug  7 10:51:05 amd kernel: ACPI: Critical trip point
Aug  7 10:51:05 amd kernel: Critical temperature reached (128 C),
shutting down.
Aug  7 10:51:05 amd shutdown[1928]: shutting down for system halt
Aug  7 10:51:06 amd init: Switching to runlevel: 0
Aug  7 10:51:06 amd kernel: thinkpad_acpi: unhandled HKEY event 0x6022
Aug  7 10:51:09 amd last message repeated 7 times
Aug  7 10:51:09 amd kernel: ACPI: Critical trip point
Aug  7 10:51:09 amd kernel: Critical temperature reached (128 C),
shutting down.
Aug  7 10:51:12 amd exiting on signal 15
Aug  7 10:54:01 amd syslogd 1.5.0#1: restart.
Aug  7 10:54:01 amd kernel: klogd 1.5.0#1, log source = /proc/kmsg
started.

...and it does not seem to be stray reading from the sensor: cat
/proc/acpi/therm*/*/* shows the bogus value in like 5 consecutive
readings.

Plus the temperature rises up to 95C before this triggers, and machine
is so hot it refuses to start again. Trip points seem to assume 128C,
too:

root@amd:~# cat /proc/acpi/thermal_zone/THM*/trip*
critical (S5):           127 C
critical (S5):           97 C
passive:                 93 C: tc1=5 tc2=4 tsp=600 devices=CPU0 CPU1
root@amd:~#

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Pavel Machek
Date: Thursday, August 7, 2008 - 3:38 am

while true; do echo -n; done &
while true; do echo -n; done &

is enough to trigger this. According to /proc/acpi/ibm, fan is running
too slowly...?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

Previous thread: 2.6.26.1 hangs on dl385 by Peter Palfrader on Wednesday, August 6, 2008 - 5:33 am. (2 messages)

Next thread: [PATCH 0/5] Support for Arcom/Eurotech Viper SBC by Marc Zyngier on Wednesday, August 6, 2008 - 6:19 am. (10 messages)