Re: 2.6.25 slow boot/reboot

Previous thread: [PATCH] result of csum_fold() is already 16bit, no need to cast by Al Viro on Sunday, April 27, 2008 - 1:27 am. (2 messages)

Next thread: [PATCH] tipc endianness annotations by Al Viro on Sunday, April 27, 2008 - 1:40 am. (3 messages)
To: LKML <linux-kernel@...>
Date: Sunday, April 27, 2008 - 1:37 am

Hello,
This weekend I got some time and decided to try out 2.6.25, but its booting process was _really_ slow in my laptop[1]. With 'old' 2.6.24.5 my machine would take about 48 secs until it gave me the login prompt. And it would take about 22 seconds to reboot.

With 2.6.25 I got from Linus git tree, my system takes about 4 min and 40 secs to boot. Sometimes looks like the system just freezes for a while, and then it continues the boot process. Rebooting also seems problematic here. After almost 3 minutes I had asked the system to reboot, it was still there. I typed reboot again and it then rebooted right away, weirdly.

Follows attached dmesg/config for both kernels.
Thanks,
-sergio

To: Sergio Luis <sergio@...>
Cc: LKML <linux-kernel@...>
Date: Sunday, April 27, 2008 - 11:23 am

Do you use LILO or GRUB for booting ? 2.6.25 works OK on the systems I
tested, but LILO really needs a lot of time to load the 2.6.25 kernel.
GRUB loads the 2.6.25 kernel at normal speed.

Bart.
--

To: Bart Van Assche <bart.vanassche@...>
Cc: Sergio Luis <sergio@...>, LKML <linux-kernel@...>
Date: Sunday, April 27, 2008 - 7:55 pm

ISTR that an _old_ version of lilo was mentioned earlier in this
thread. As a datapoint, on my one desktop box which uses lilo (an
athlon64 uniprocessor) both 32 and 64-bit 2.6.25 kernels boot fine.
Both of those systems are with lilo-22.8, and gcc-4.2.2.

But, I think you (Bart) haven't said which version of lilo you are
using ? If it isn't recent, perhaps upgrading it might help ?

For Sergio, you have my sympathy. I totally failed to bisect my own
problem with 2.6.25-rc (and 2.6.24.1), although I did find the problem
by other means, and got a work-around, so I'm not competent to
diagnose what is wrong, but maybe I can help to tease out what is
different about your box. As a start, you could try diffing your
config's for 2.6.24.5 and 2.6.25 in case something odd has changed.

Or, perhaps this is a problem specific to a certain processor ? So
far, I think all the list knows is that you have a problem, and an
old version of lilo. More data might eventually help to identify
what is causing this. If you have a fairly old version of lilo,
maybe you also have an old version of gcc ?

For Bart too, which version(s) of gcc are you using on the systems
where lilo is slow to load, and which cpu(s) do you have there ?

Ken, who relies on lilo for his server, and gets worried by reports
of trouble with it.
--
das eine Mal als Tragödie, das andere Mal als Farce
--

To: Ken Moffat <zarniwhoop@...>
Cc: Sergio Luis <sergio@...>, LKML <linux-kernel@...>
Date: Monday, April 28, 2008 - 2:10 am

I noticed the slow loading behavior of LILO after a fresh install of
Ubuntu 8.04 beta with XFS as root file system. I did not check the
version number. Should I look it up ?

Bart.
--

To: Bart Van Assche <bart.vanassche@...>
Cc: Sergio Luis <sergio@...>, LKML <linux-kernel@...>
Date: Monday, April 28, 2008 - 9:27 am

'/sbin/lilo -V' but a look at their pool implies it will either be
22.8 or 22.6. A quick look at the changelog in their diff for 22.8
shows they are using device-mapper, dunno if that has any relevance to
this problem.

Ken
--
das eine Mal als Tragödie, das andere Mal als Farce
--

To: Ken Moffat <zarniwhoop@...>
Cc: Bart Van Assche <bart.vanassche@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Glauber Costa <gcosta@...>
Date: Sunday, April 27, 2008 - 9:24 pm

I tried bisecting and after some hours I got
9713277607f9eac7d655c6854dd92bc2ce1b6f02 as first bad commit

commit 9713277607f9eac7d655c6854dd92bc2ce1b6f02
Author: Glauber de Oliveira Costa <gcosta@redhat.com>
Date: Wed Mar 19 14:25:43 2008 -0300

x86: boot cpus from cpu_up, instead of prepare_cpus

After all the infrastructure work, we're now prepared
to boot the cpus from cpu_up, and not from prepare_cpus.
So the difference between cold boot and hotplug is effectively
over, and the functions are used to the purposes they're meant to.

Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

lilo version is 22.8
gcc version is 4.1.2
the processor is an amd turion 64x2 2.0 ghz (tl-60) and I am building a 32bit kernel.

note lilo is indeed much slower than grub to start booting the kernel here, but I am
talking about this 2.6.25 kernel taking almost 5 min to finish the boot process (once it is
actually started by the bootloader) when it would take less than 1 minute with 2.6.24.5 in
this same machine.

thanks,

-sergio
--

To: Sergio Luis <sergio@...>
Cc: Ken Moffat <zarniwhoop@...>, Bart Van Assche <bart.vanassche@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Monday, April 28, 2008 - 10:19 am

Can you give me more information on that?
your .config and cpuinfo would be a great start.

I'm specially interested in things involving APIC.
--

To: Glauber Costa <gcosta@...>
Cc: Ken Moffat <zarniwhoop@...>, Bart Van Assche <bart.vanassche@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Monday, April 28, 2008 - 10:35 am

Hello,
I sent the config's (2.6.24.5 and 2.6.25) attached in the first
message of this thread http://lkml.org/lkml/2008/4/27/19
about the cpuinfo, I don't have that computer here right now, as I am
at work, so I can't send it at the moment. It's a TL-60
AMD Turion 64x2 (2.0 GHZ), though, if it helps, and as I said, I am
building a 32 bit kernel with gcc 4.1.2.
Let me know if you need additional info and I will be sending later on
when I am with that laptop again.
thanks,
-sergio
--

To: Sergio Luis <sergio@...>
Cc: Ken Moffat <zarniwhoop@...>, Bart Van Assche <bart.vanassche@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Monday, April 28, 2008 - 8:32 pm

Sergio,

You said your system freezes. Does it happen after the last message you
see on dmesg, or during the kernel start up? It would help me to rule
out (or not), any issues in the cpu initialization process itself.

As for reboot, any suspicious message on your kernel log?
--

To: Glauber Costa <gcosta@...>
Cc: Ken Moffat <zarniwhoop@...>, Bart Van Assche <bart.vanassche@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Tuesday, April 29, 2008 - 2:18 am

Hello Glauber,

the "freeze" happens during the kernel start up. It goes starting with delays in some points.
as I mentioned, when it prints "Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled"
it takes 90 seconds to print the next message. And then to reach that "serial" message it took already 20 seconds.
And it keeps on freezing at some other points too, after that. I also noticed the mouse cursor in the left bottom
is blinking in a very slow frequency, like every 6 seconds. It really disappears for a while, then returns and stay
for another while and so on.

by the way, I messed up my git tree and I was wondering how could I have it back to the state it was when I started
testing this. this linus' commit was the last one, by then:

----

commit c3bf9bc243092c53946fd6d8ebd6dc2f4e572d48
Merge: e3505dd... c2b91e2...
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat Apr 26 14:04:32 2008 -0700

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x8
6/linux-2.6-x86-bigbox-bootmem-v3

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6
-x86-bigbox-bootmem-v3:
x86_64/mm: check and print vmemmap allocation continuous
x86_64: fix setup_node_bootmem to support big mem excluding with memmap
x86_64: make reserve_bootmem_generic() use new reserve_bootmem()
mm: allow reserve_bootmem() cross nodes
mm: offset align in alloc_bootmem()
mm: fix alloc_bootmem_core to use fast searching for all nodes
mm: make mem_map allocation continuous

----

right now I am not able to reproduce the _exact_ problem I had when i created this thread, i.e. I can't boot anymore. I tested 2.6.25 and it didn't have any
problems, it booted normally, like my 2.6.24.5 did. Then I tried 2.6.25-git1 and it gave me the delay problem I wrote in this thread about, but then the system panics with
"Unable to mount root fs on unknown-block(8,5)". I tested with 2.6.25-git11 and 2.6....

To: Sergio Luis <sergio@...>
Cc: Ken Moffat <zarniwhoop@...>, Bart Van Assche <bart.vanassche@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Tuesday, April 29, 2008 - 8:26 am

This does not correspond, by far, to what is seen in the dmesg you posted.

[ 1.375711] Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ
sharing enabled
[ 4.399907] floppy0: no floppy controllers found

a 90-second delay is not what's happening, or at least, not what the
kernel is seeing. So my bet would
be something clock-related. Probably the system's clocksource is not
running the time correctly, which is causing system events to be
delayed. I fail to see, however, how does the patch you bisected to

So maybe the commit you bisected too is not really guilty. If you can
now boot 2.6.25 fine, it might be the case that the commits you marked

--

To: Glauber Costa <gcosta@...>
Cc: Sergio Luis <sergio@...>, Ken Moffat <zarniwhoop@...>, Bart Van Assche <bart.vanassche@...>, LKML <linux-kernel@...>
Date: Tuesday, April 29, 2008 - 10:28 am

the first thing to check, does latest x86.git work fine:

http://people.redhat.com/mingo/x86.git/README

? We've got fixes queued up - in particular one could result in 'slow'
systems by virtue of denying an ioremap():

Subject: revert: "x86: ioremap(), extend check to all RAM pages"

maybe the bisection went haywire.

Or the secondary core booted up in such a sucky way that it causes such
massive slowdowns? Perhaps we are flooding the system with local APIC
timer interrupts or other interrupts?

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Glauber Costa <gcosta@...>, Ken Moffat <zarniwhoop@...>, Bart Van Assche <bart.vanassche@...>, LKML <linux-kernel@...>
Date: Tuesday, April 29, 2008 - 9:05 pm

nope, it doesn't work fine, but gives me the same issue (I had to apply
the scsi patch in http://lkml.org/lkml/2008/4/27/309 in order to make

I tried re-bisecting using now 2.6.25 as my initial good kernel, but in
the 2nd or 3r iteration I start having compilation problems like such as

--
ld:fs:afs/cell.o: file format not recognized; treating as linker script
ld:fs/afs/cell.o:1: syntax error

The same problem happens with fs/autofs/inode.o and crypto/hmac.o
--

and it stopped the bisection process from going forward.

I am using binutils 2.17.50.0.17 (slackware 12 here) and also tried the
latest binutils cvs, but it gives me the same error there.

Any suggestions?
thanks,

--

To: Bart Van Assche <bart.vanassche@...>
Cc: LKML <linux-kernel@...>
Date: Sunday, April 27, 2008 - 12:33 pm

I use LILO, but read your answer and tested GRUB, and you are right, GRUB starts the boot process much faster than LILO, but I was talking about the time it takes once the kernel is booting already. Isn't that independent from the bootloader used?

I experience the same problems when I use GRUB.
For instance, I see
"Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled"
and the system seems to freeze for a while. It takes about 90 seconds for the next message (floppy0: no floppy controllers found) to show up.

-sergio
--

Previous thread: [PATCH] result of csum_fold() is already 16bit, no need to cast by Al Viro on Sunday, April 27, 2008 - 1:27 am. (2 messages)

Next thread: [PATCH] tipc endianness annotations by Al Viro on Sunday, April 27, 2008 - 1:40 am. (3 messages)