Hi! Aug 6 11:00:10 amd kernel: ACPI: Critical trip point Aug 6 11:00:10 amd kernel: Critical temperature reached (128 C), shutting down. Aug 6 11:00:10 amd shutdown[24414]: shutting down for system halt ...and machine went down at that point :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
It seems it like a bad day for ThinkPads:
Aug 6 20:05:36 nb kernel: ACPI: Critical trip point
Aug 6 20:05:36 nb kernel: Critical temperature reached (128 C),
shutting down.
Kernel 2.6.26-136.fc10.x86_64 on x61.
Karel
--
Karel Zak <kzak@redhat.com>
--
Seems not only limited to ThinkPads as my HP laptop started to shut down for thermal events with 27-rc1 too. It happened twice (one with -rc1 and another with -rc2). Never happened before. --
I hope you can easily reproduce it? So it's new in 2.6.27rc1 and wasn't in 2.6.26? Can you please double check that? Are there are new warnings in the boot logs from ACPI compared to .26? I looked through the pile of patches that went in for ACPI and the only candidate that might have imho caused this would be ea51011a27db48ea0a80a5e20de3969b292d5d4d. Can you please try reverting that. If that doesn't help a full bisect will be needed. -Andi --
Not that one :-(. Thinkpad does not even have fan device: it is controlled by hardware. Pavel --- /tmp/dmesg.26 2008-08-12 11:38:44.000000000 +0200 +++ /tmp/dmesg.rc2 2008-08-12 11:15:44.000000000 +0200 @@ -1,4 +1,4 @@ -Linux version 2.6.26 (pavel@amd) (gcc version 4.1.3 20071209 (prerelease) (Debian 4.1.2-18)) #313 SMP Mon Jul 14 08:33:14 CEST 2008 +Linux version 2.6.27-rc2 (pavel@amd) (gcc version 4.1.3 20071209 (prerelease) (Debian 4.1.2-18)) #322 SMP Thu Aug 7 11:58:09 CEST 2008 PAT disabled. Not yet verified on this CPU type. BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009f000 (usable) @@ -16,31 +16,13 @@ BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved) -1142MB HIGHMEM available. -896MB LOWMEM available. -found SMP MP-table at [c00f67f0] 000f67f0 -Entering add_active_range(0, 0, 521936) 0 entries of 256 used -Zone PFN ranges: - DMA 0 -> 4096 - Normal 4096 -> 229376 - HighMem 229376 -> 521936 -Movable zone start PFN for each node -early_node_map[1] active PFN ranges - 0: 0 -> 521936 -On node 0 totalpages: 521936 - DMA zone: 32 pages used for memmap - DMA zone: 0 pages reserved - DMA zone: 4064 pages, LIFO batch:0 - Normal zone: 1760 pages used for memmap - Normal zone: 223520 pages, LIFO batch:31 - HighMem zone: 2286 pages used for memmap - HighMem zone: 290274 pages, LIFO batch:31 - Movable zone: 0 pages used for memmap +last_pfn = 0x7f6d0 max_arch_pfn = 0x100000 +kernel direct mapping tables up to 38000000 @ 7000-c000 DMI present. ACPI: RSDP 000F67C0, 0024 (r2 LENOVO) ACPI: XSDT 7F6D191C, 0084 (r1 LENOVO TP-7B 2140 LTP 0) ACPI: FACP 7F6D1A00, 00F4 (r3 LENOVO TP-7B 2140 LNVO 1) -ACPI Warning (tbfadt-0442): Optional field "Gpe1Block" has zero address or length: 000000000000102C/0 [20080321] +ACPI Warning ...
Does this mean you can easily reproduce it? Ok it was just a long shot anyways. -Andi --
Hi, I see exactly the same on my x60s, but during upgrade to 2.6.26.2. I found that (at least in my case) the problem is, that in 2.6.25 the core frequency drop to 1GHz (instead of 1.67GHz) when the temperature is above some limit. Now, the CPU cores remains on 1.67GHz and fan is unable to cool them properly under heavy load (even if I set "level disengaged" through thinkpad fan control, temperature sensor shows after a while 128 C (probably not real temp, I expect some critical flag => and it properly switch off the system...) (I had bad reproducer script in bisect and bisect failed, so I'll try it again, but anyway, for me the bug is even in 2.6.26 tree. It never happened in 2.6.25.) Milan --
How do you control fans? I could not get anything but -EINVAL from IBM Hmmm... that's seriously strange. I definitely don't see it in 2.6.26. Maybe it is config dependend?! (Attaching my 2.6.27-rc2 failing config.) Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
yes. maybe some userspace tool controlling frequency is involved, no idea yet. No, it is not ok. you need add fan_control=1 to thinkpad_acpi module http://www.thinkwiki.org/wiki/How_to_control_fan_speed hm. strange, I'll try this config too... Milan --
So it definitely is in 2.6.26.2, and it definitely is in 2.6.26? Thanks for pointers! -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Pavel, can you check if the state of the fan(s) change while the thermal trip points are being passed? As I said in http://bugzilla.kernel.org/show_bug.cgi?id=11281, I suspect that this mechanism may be broken. Thanks, Rafael --
The bug is _not_ in 2.6.26, it was introduced in 2.6.26.1.
The problem is, that now the CPU frequency doesn't decrease at some
temperature level and fan is unable to cool it properly.
bisect on 2.6.26.y tree finished in this patch:
(I expect similar patch in 2.6.27-rc)
commit 04f496871e8af87a1e40c504371a206fd7389193
Author: Thomas Renninger <trenn@suse.de>
Date: Wed Jul 30 18:20:10 2008 +0000
cpufreq acpi: only call _PPC after cpufreq ACPI init funcs got called already
commit a1531acd43310a7e4571d52e8846640667f4c74b upstream
Ingo Molnar provided a fix to not call _PPC at processor driver
initialization time in "[PATCH] ACPI: fix cpufreq regression" (git
commit e4233dec749a3519069d9390561b5636a75c7579)
But it can still happen that _PPC is called at processor driver
initialization time.
This patch should make sure that this is not possible anymore.
That seems strange to me... please could anyone verify that it
on some other x60?
Milan
--
and this seems to fix it for me:
--
Do not use unsigned int if there is test for negative number...
See drivers/acpi/processor_perflib.c
static unsigned int ignore_ppc = -1;
...
if (event == CPUFREQ_START && ignore_ppc <= 0) {
ignore_ppc = 0;
...
Signed-off-by: Milan Broz <mbroz@redhat.com>
---
drivers/acpi/processor_perflib.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6.26.y/drivers/acpi/processor_perflib.c
===================================================================
--- linux-2.6.26.y.orig/drivers/acpi/processor_perflib.c 2008-08-12 17:20:07.000000000 +0200
+++ linux-2.6.26.y/drivers/acpi/processor_perflib.c 2008-08-12 17:35:53.000000000 +0200
@@ -70,7 +70,7 @@ static DEFINE_MUTEX(performance_mutex);
* 0 -> cpufreq low level drivers initialized -> consider _PPC values
* 1 -> ignore _PPC totally -> forced by user through boot param
*/
-static unsigned int ignore_ppc = -1;
+static int ignore_ppc = -1;
module_param(ignore_ppc, uint, 0644);
MODULE_PARM_DESC(ignore_ppc, "If the frequency of your machine gets wrongly" \
"limited by BIOS, this should help");
--
Hmm, the machine should still not shut down. We need the virtual
Ohh dear..., what kind of obvious bug have I introduced.
Thanks a lot!
Thomas
--
Won't help here. We already do have real passive trip point on the other thermal zone, and the zone that actually forces shutdown goes 95->128C instantly (see that DSDT). Virtual passive trip point at 115C will not help anything. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
> and this seems to fix it for me: Great. Thanks for the patch. I wonder why gcc didn't warn about this. -Andi --
Hi, ^^^^ follow-up change? Best, Dominik --
I'll fix it in the patch, thanks. -Andi --
Is the complete patch available anywhere? I need a link to it for the list of regressions. Thanks, Rafael --
yep, thanks.
I am running my x60s with this patch now:
--
Fix signed parameter in ACPI frequency notifier.
static unsigned int ignore_ppc = -1;
...
if (event == CPUFREQ_START && ignore_ppc <= 0) {
ignore_ppc = 0;
...
Signed-off-by: Milan Broz <mbroz@redhat.com>
---
drivers/acpi/processor_perflib.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: linux-2.6.26.y/drivers/acpi/processor_perflib.c
===================================================================
--- linux-2.6.26.y.orig/drivers/acpi/processor_perflib.c 2008-08-12 17:20:07.000000000 +0200
+++ linux-2.6.26.y/drivers/acpi/processor_perflib.c 2008-08-13 09:32:38.000000000 +0200
@@ -70,8 +70,8 @@ static DEFINE_MUTEX(performance_mutex);
* 0 -> cpufreq low level drivers initialized -> consider _PPC values
* 1 -> ignore _PPC totally -> forced by user through boot param
*/
-static unsigned int ignore_ppc = -1;
-module_param(ignore_ppc, uint, 0644);
+static int ignore_ppc = -1;
+module_param(ignore_ppc, int, 0644);
MODULE_PARM_DESC(ignore_ppc, "If the frequency of your machine gets wrongly" \
"limited by BIOS, this should help");
--
(adding cc: stable, the bug is also in 2.6.26.2)
yep, thanks.
I am running my x60s with this patch now:
--
Fix signed parameter in ACPI frequency notifier.
static unsigned int ignore_ppc = -1;
...
if (event == CPUFREQ_START && ignore_ppc <= 0) {
ignore_ppc = 0;
...
Signed-off-by: Milan Broz <mbroz@redhat.com>
---
drivers/acpi/processor_perflib.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: linux-2.6.26.y/drivers/acpi/processor_perflib.c
===================================================================
--- linux-2.6.26.y.orig/drivers/acpi/processor_perflib.c 2008-08-12 17:20:07.000000000 +0200
+++ linux-2.6.26.y/drivers/acpi/processor_perflib.c 2008-08-13 09:32:38.000000000 +0200
@@ -70,8 +70,8 @@ static DEFINE_MUTEX(performance_mutex);
* 0 -> cpufreq low level drivers initialized -> consider _PPC values
* 1 -> ignore _PPC totally -> forced by user through boot param
*/
-static unsigned int ignore_ppc = -1;
-module_param(ignore_ppc, uint, 0644);
+static int ignore_ppc = -1;
+module_param(ignore_ppc, int, 0644);
MODULE_PARM_DESC(ignore_ppc, "If the frequency of your machine gets wrongly" \
"limited by BIOS, this should help");
--
Tested-by: Pavel Machek <pavel@suse.cz> -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Verified. Your patch from the next email fixes the problem here. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Thinkpads don't expose fans as ACPI devices, so there's no active trip points. -- Matthew Garrett | mjg59@srcf.ucam.org --
thinkpad-acpi will regard 128 and -128 as invalid sensors, because that's how they are used in some BIOSes (and ECs). We used to bother only with -128, but Lenovo did something wierd in one of the EC firmwares and I had to add +128 too. That masks the "help, I am melting" reading. I have noted that in my TODO, let's see if I can make that into a quirk so that you won't get -EINVAL anymore. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh --
It was simpler than that. I did not pass "fan_control=1" option. (Actually... I do not think that option is needed. If fan control is known to work, it should be just enabled...) Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I require explicit user permission to activate knobs that are that dangerous, and actively frowned upon by the manufacturer. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh --
Well, it is not more dangerous than 2.6.26.2 (will overheat your thinkpad, causing critical shutdown). Plus, acpi fans are controllable/overridable by default from /proc, and 'echo "level 7" > fan' is not something you can do accidentally... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
It is easily reproduced, but it takes 10+ minutes, and at the end machine is so hot it will not even power up. So yes, bisect is possible, but I'd prefer to avoid it. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
that's weird. ACPICA should be 20080609 in 2.6.26. Pavel, can you please make a double check? :) thanks, rui --
This one is quite repeatable: Aug 7 10:46:24 amd dhclient: DHCPACK from 10.20.0.2 Aug 7 10:46:24 amd dhclient: bound to 10.20.5.28 -- renewal in 7200 seconds. Aug 7 10:50:46 amd kernel: thinkpad_acpi: unhandled HKEY event 0x6022 Aug 7 10:51:03 amd last message repeated 48 times Aug 7 10:51:05 amd kernel: ACPI: Critical trip point Aug 7 10:51:05 amd kernel: Critical temperature reached (128 C), shutting down. Aug 7 10:51:05 amd shutdown[1928]: shutting down for system halt Aug 7 10:51:06 amd init: Switching to runlevel: 0 Aug 7 10:51:06 amd kernel: thinkpad_acpi: unhandled HKEY event 0x6022 Aug 7 10:51:09 amd last message repeated 7 times Aug 7 10:51:09 amd kernel: ACPI: Critical trip point Aug 7 10:51:09 amd kernel: Critical temperature reached (128 C), shutting down. Aug 7 10:51:12 amd exiting on signal 15 Aug 7 10:54:01 amd syslogd 1.5.0#1: restart. Aug 7 10:54:01 amd kernel: klogd 1.5.0#1, log source = /proc/kmsg started. ...and it does not seem to be stray reading from the sensor: cat /proc/acpi/therm*/*/* shows the bogus value in like 5 consecutive readings. Plus the temperature rises up to 95C before this triggers, and machine is so hot it refuses to start again. Trip points seem to assume 128C, too: root@amd:~# cat /proc/acpi/thermal_zone/THM*/trip* critical (S5): 127 C critical (S5): 97 C passive: 93 C: tc1=5 tc2=4 tsp=600 devices=CPU0 CPU1 root@amd:~# Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
while true; do echo -n; done & while true; do echo -n; done & is enough to trigger this. According to /proc/acpi/ibm, fan is running too slowly...? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
