Re: 2.6.26-rc1 fails to boot

Previous thread: 2.6.26-rc1 on x86: ld: warning: dot moved backwards before `.text' by Mikael Pettersson on Saturday, May 10, 2008 - 4:18 pm. (16 messages)

Next thread: [2.6.24] [Q] x86: clear DF before calling signal handler II. by Oliver Pinter on Saturday, May 10, 2008 - 4:33 pm. (1 message)
To: <linux-kernel@...>
Date: Saturday, May 10, 2008 - 4:21 pm

Hello,

Kernel 2.6.26-rc1 completely fails to boot on my laptop. The last line
appearing on the console is:

ACPI: Processor [CPU1] (supports 8 throttling states)

Kernel 2.6.25 fails to boot in an estimated 1 in 2 attempts. The last
line on the console if this one fails is:

ACPI: LNXTHERM:01 is registered as thermal_zone0

I did a git bisection with "fails to boot in the first three attempts"
as the definition of "bad" between 2.6.26-rc1 and 2.6.25-rc3 which I
believed to be free of this problem. The many reboots in the
bisection process however showed me that 2.6.25-rc3 also has an
estimated 1 in 10 chance of failing. Anyway, the bisection ended
with:

8a3227268877b81096d7b7a841aaf51099ad2068 is first bad commit

The bisection log and dmesg are attached.

To: Henny Wilbrink <wsdwhw@...>
Cc: <linux-kernel@...>, <venkatesh.pallipadi@...>
Date: Saturday, May 10, 2008 - 6:22 pm

On Sat 10.May'08 at 22:21:28 +0200, Henny Wilbrink wrote:

I have this boot problem since before 2.6.25-rc1 but it is
very difficult to do a bisection. Sometimes the probability
of hanging was 1:30 and what I thought was a good kernel in
fact was bad.

In fact the commit you got is a "merge commit" which does not
change the source code so it is clearly bogus.

Mark Lord points out that this bug comes and goes with slight
modifications in the .config, and in fact very recently
the probability of hanging changed dramatically for me.

I have already booted the kernel afa26be86b65 (six commits
after 2.6.26-rc1) more than 30 times and it did not hang.

You can also take a look at some discussion here:
http://bugzilla.kernel.org/show_bug.cgi?id=10117

And good luck with this!
--

To: Carlos R. Mafra <crmafra2@...>
Cc: <linux-kernel@...>, <venkatesh.pallipadi@...>
Date: Sunday, May 11, 2008 - 5:35 am

No, 2.6.26-rc1 with hpet=disable now stops at

ata_piix 0000:00:1f.2: MAP [ P0 P2 IDE IDE ]

However, with idle=mwait as suggested by Venkatesh Pallipadi it did

Thanks,

Henny
--

To: Henny Wilbrink <wsdwhw@...>
Cc: <linux-kernel@...>, <venkatesh.pallipadi@...>, <tglx@...>, <liml@...>
Date: Sunday, May 11, 2008 - 3:55 pm

I think so too, but it is difficult to really find where
is problem is (hpet?, nohz?, cpuidle? etc)

So today I tried to humbly hack a bit to get a trace, and I think
I managed to do it.

The observation in my notes was that with 2.6.25-rc9 the
kernel printed two lines when I pressed the power button
when the boot hung:

evmisc-0145 [00] ev_queue_notify_reques: Dispatching Notify(80) on node ffff81007f04ee30
evmisc-0154 [00] ev_queue_notify_reques: Notify value: 0x80 **Device Specific*

so the kernel was not completely hung. Sometimes lines similar to the above (I
don't remember the numbers anymore) were being printed continuosly after the
hang for more than 15 minutes at a rate of more or less 1 per minute.

So my idea was to insert a WARN_ON(1) in a few places, and in particular the
function which had those evmisc printk's.

Then I got some traces before and after the hang point, and another trace when
I pressed the power button! I don't know if they will reveal something
interesting to kernel hackers, but I will transcribe them here (I lost my
digital camera, so it took me a long time to write them).

This is how the screen looked like when the boot hung (caveat lector: it
was all copied by hand, so there may be typos):

[<ffffffff8022a909>] ? update_rq_clock+0x19/0x20
[<ffffffff8023fbfa>] run_timer_softirq+0x2a/0x230
[<ffffffff8022a81a>] ? __update_rq_clock+0x2a/0x100
[<ffffffff8023bc44>] __do_softirq+0x74/0xf0
[<ffffffff8020fcca>] ? profile_pc+0x3a/0x70
[<ffffffff8020d45c>] call_softirq+0x1c/0x30
[<ffffffff8020faed>] do_softirq+0x3d/0x80
[<ffffffff8023bbc5>] irq_exit+0x85/0x90
[<ffffffff8021f4de>] smp_apic_timer_interrupt+0x7e/0xc0
[<ffffffff8020b260>] ? mwait_idle+0x0/0x50
[<ffffffff8020b0e0>] ? default_idle+0x0/0x70
[<ffffffff8020cf06>] apic_timer_interrupt+0x66/0x70
<EOI> [<ffffffff8020b2a0>] ? mwait_idle+0x40/0x50
[<ffffffff8020a882>] ? enter_idle+0x22/0x30
[<ffff...

To: Carlos R. Mafra <crmafra2@...>
Cc: <linux-kernel@...>, <venkatesh.pallipadi@...>, <tglx@...>, <liml@...>
Date: Monday, May 12, 2008 - 7:58 am

Well, for what it is worth, I just finished building 2.6.26-rc2 and it
booted ok three times in a row.

Regards,
Henny
--

Previous thread: 2.6.26-rc1 on x86: ld: warning: dot moved backwards before `.text' by Mikael Pettersson on Saturday, May 10, 2008 - 4:18 pm. (16 messages)

Next thread: [2.6.24] [Q] x86: clear DF before calling signal handler II. by Oliver Pinter on Saturday, May 10, 2008 - 4:33 pm. (1 message)