Re: today's linux-next fails to boot

Previous thread: [PATCH] x86/pci: Changing subsystem init for visws by Robert Richter on Friday, July 11, 2008 - 3:26 am. (1 message)

Next thread: Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.3.0: Introduction by Ryo Tsuruta on Friday, July 11, 2008 - 4:14 am. (3 messages)
From: Török Edwin
Date: Friday, July 11, 2008 - 4:12 am

Hi,

Today's linux-next tree (commit
93847083e4791567931bd17c039cc35881cdad29) fails to boot:
[built with gcc-4.2.4-3]

BUG: Int 14: CR2 b0049dea
     EDI 00000082 ESI 00000000 EBP c059be88 ESP c059be5c
     EBX f000ec62 EDX 0000000e ECX c0595480 EAX f000ec62
     err 00000000 EIP c0181ca0  CS 00000060 flg 00010082
Stack:   00000040 c06a2ba0 000080d0 c0595480 c0000f19c c000f180 c0581120
c059bea8
         c02bf19b 00000000 00000080 c059beb8 c0000f194 c000f180 0000000a
c059beb8
         c03a1059 00000000 00000000 c059bed8 c05c4c7c  0009efff 00000000
c04f4df4

I get this as soon as I boot from grub2, strangely the error message is
at the bottom of the screen, and I can't see the full message (scrolling
won't work).

The last kernel I built & booted was 2.6.26-rc8 from Linus's tree. I
will try to built&boot 2.6.26-rc9, and then bisect.

This happens on 32-bit Dell Inspiron 6400 (Intel Core Duo T2300 @1.66
Ghz CPU),  Intel ICH-7 chipset, and a seagate SATA drive. 
I will provide  full hardware details once I bisected the problem.

Meanwhile, if somebody has an idea as to what is wrong?

Best regards,
--Edwin
--

From: Takashi Iwai
Date: Saturday, July 12, 2008 - 8:03 am

At Fri, 11 Jul 2008 14:12:11 +0300,

[Added Ingo to Cc]

I get the boot problem on i386 with 2008-07-11 linux-next tree, too.
In my case, no error appears on the screen, just staying blank and
dead.  It seems stopping at the very beginning, soon after GRUB, so
could be the same reason.

The same config worked fine with yesterday's tree (2008-07-10) on the
same machine. 
Also, today's tree works on x86-64 (but on another machine).

My config is below.


thanks,

Takashi

---
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.26-rc9
# Tue Jul  8 11:37:08 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not ...
From: Török Edwin
Date: Friday, July 11, 2008 - 6:13 am

I don't see any boot messages on the screen, I get that BUG message as
soon as grub's menu dissapears.
I have bisected it to this range so far:
git-bisect good aa03060a78c1aec53075a0c8ca7be19cedfbea8f
git-bisect bad b1611c0058bc6635e7257e755c3f194933a7a6df

Should I continue to bisect?

See git-bisect log, and .config below.

git-bisect start
# good: [e5a5816f7875207cb0a0a7032e39a4686c5e10a4] Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
git-bisect good e5a5816f7875207cb0a0a7032e39a4686c5e10a4
# bad: [93847083e4791567931bd17c039cc35881cdad29] Add linux-next
specific files for 20080711
git-bisect bad 93847083e4791567931bd17c039cc35881cdad29
# bad: [5aeabb501abf4fa99a85c9dd347f2c3399545f01] Merge commit 'kvm/master'
git-bisect bad 5aeabb501abf4fa99a85c9dd347f2c3399545f01
# bad: [da1671f1e4d1c7baf81e938e1c20b99fa6a79982] Revert "kconfig:
normalize int/hex values"
git-bisect bad da1671f1e4d1c7baf81e938e1c20b99fa6a79982
# good: [9e97638c0ab1588913e298b41fca68a593650058] Merge branch
'x86/core' into auto-x86-next
git-bisect good 9e97638c0ab1588913e298b41fca68a593650058
# good: [aa03060a78c1aec53075a0c8ca7be19cedfbea8f] Merge commit
'safe-poison-pointers/auto-safe-poison-pointers-next'
git-bisect good aa03060a78c1aec53075a0c8ca7be19cedfbea8f
# bad: [10889486f1de748096e999ee6b9d22890504cebf] Merge branch 'quilt/i2c'
git-bisect bad 10889486f1de748096e999ee6b9d22890504cebf
# bad: [b1611c0058bc6635e7257e755c3f194933a7a6df] Merge commit
'x86/auto-x86-next'
git-bisect bad b1611c0058bc6635e7257e755c3f194933a7a6df

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.26-rc9
# Fri Jul 11 13:08:25 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
# CONFIG_GENERIC_LOCKBREAK is not ...
From: Ingo Molnar
Date: Friday, July 11, 2008 - 6:59 am

could you check latest tip/master, does it boot fine with the same 
config?

	Ingo
--

From: Török Edwin
Date: Friday, July 11, 2008 - 7:48 am

tip/master boots fine.


Thanks for the hint, I rebuilt a failing kernel, and this is what
addr2line says:

$ addr2line -e vmlinux -i c0181ca0

??:0
$ addr2line -e vmlinux -f c0181ca0
kmem_cache_alloc
??:0

Best regards,
--Edwin


--

From: Ingo Molnar
Date: Friday, July 11, 2008 - 7:53 am

to update linux-next to the latest bits in -tip, you can perhaps do 
something like this:

 git-merge tip/auto-x86-next
 git-merge tip/auto-core-next
 git-merge tip/auto-cpus4096-next
 git-merge tip/auto-ftrace-next
 git-merge tip/auto-generic-ipi-next
 git-merge tip/auto-genirq-next
 git-merge tip/auto-latest
 git-merge tip/auto-safe-poison-pointers-next
 git-merge tip/auto-sched-next
 git-merge tip/auto-stackprotector-next
 git-merge tip/auto-timers-next

but it's easily possible that the bug is in some other portion of 
linux-next.

	Ingo
--

From: Rafael J. Wysocki
Date: Friday, July 11, 2008 - 8:10 am

Hm, I haven't tested the linux-next from today myself yet, but I have a related
question.  Namely, is there a way to get a log of commits that have been
added since the previous linux-next?

That may help to find a guilty patch if the yesterday's linux-next works.

Thanks,
Rafael
--

From: Ingo Molnar
Date: Friday, July 11, 2008 - 12:07 pm

at least in -tip it works like this:

   git-shortlog tip-history-2008-07-10_09.58_Thu..

or, a more practical format with commit IDs on the same line:

  git log --no-merges --pretty=format:"%h: %s" \
      tip-history-2008-07-10_09.58_Thu..

you can restrict it to a given piece of code as well, say:

  git log --no-merges --pretty=format:"%h: %s" \
      tip-history-2008-07-10_09.58_Thu.. -- arch/x86/ include/asm-x86/

this doesnt work in linux-next nearly as well, due to the Quilt imported 
trees. Every time a quilt queue is updated and reimported, there's a 
stream of repeat commits.

	Ingo
--

From: Rafael J. Wysocki
Date: Friday, July 11, 2008 - 2:06 pm

Thanks for the tips.

Well, it turns out that linux-next from today doesn't boot on my box too
(64-bit) and I don't see anything obviously suspicious.  Bisection time.

Thanks,
Rafael
--

From: Rafael J. Wysocki
Date: Friday, July 11, 2008 - 5:50 pm

Hi Ingo,


I have identified the source of the breakage on my box, but I don't really
think it's the same problem that Edwin is observing.

Namely, it turns out that some code in arch/x86/kernel/acpi/boot.c, as in
today's linux-next, doesn't really make sense, because we have two conflicting
DMA-based quirks in there for the same set of boxes (HP nx6325 and nx6125) and
one of them actually breaks my box.

I have reported that already, but it probably got lost somewhere.

Below is a patch that fixes things for me, on top of today's linux-next.
Please apply.

Thanks,
Rafael

---
Remove some code that breaks my HP nx6325 from arch/x86/kernel/acpi/boot.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 arch/x86/kernel/acpi/boot.c |   47 --------------------------------------------
 1 file changed, 47 deletions(-)

Index: linux-next/arch/x86/kernel/acpi/boot.c
===================================================================
--- linux-next.orig/arch/x86/kernel/acpi/boot.c
+++ linux-next/arch/x86/kernel/acpi/boot.c
@@ -84,8 +84,6 @@ int acpi_lapic;
 int acpi_ioapic;
 int acpi_strict;
 
-static int disable_irq0_through_ioapic __initdata;
-
 u8 acpi_sci_flags __initdata;
 int acpi_sci_override_gsi __initdata;
 int acpi_skip_timer_override __initdata;
@@ -982,10 +980,6 @@ void __init mp_override_legacy_irq(u8 bu
 	int pin;
 	struct mp_config_intsrc mp_irq;
 
-	/* Skip the 8254 timer interrupt (IRQ 0) if requested.  */
-	if (bus_irq == 0 && disable_irq0_through_ioapic)
-		return;
-
 	/*
 	 * Convert 'gsi' to 'ioapic.pin'.
 	 */
@@ -1052,10 +1046,6 @@ void __init mp_config_acpi_legacy_irqs(v
 	for (i = 0; i < 16; i++) {
 		int idx;
 
-		/* Skip the 8254 timer interrupt (IRQ 0) if requested.  */
-		if (i == 0 && disable_irq0_through_ioapic)
-			continue;
-
 		for (idx = 0; idx < mp_irq_entries; idx++) {
 			struct mp_config_intsrc *irq = mp_irqs + idx;
 
@@ -1413,17 +1403,6 @@ static int __init force_acpi_ht(const st
 }
 
 /*
- * Don't register any ...
From: Ingo Molnar
Date: Friday, July 11, 2008 - 9:47 pm

thanks Rafael, i have applied your fix to tip/x86/core and to 
auto-x86-next as well.

what happened is a case of too many fixes for the same problem :-/

	Ingo
--

From: Vegard Nossum
Date: Friday, July 11, 2008 - 7:54 am

Ouch, sorry. You need CONFIG_DEBUG_INFO for this to really make any
sense. My habits are deceiving me, because I always build with
DEBUG_INFO. Ingo claims that this slows down his builds, so he never
does it. In the end, it is up to each one of us to choose his options,
but in general, I think it should be used when testing kernels.

If you're tired of rebuilding kernels now, I don't blame you. Another
lesson learned for me :-(


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
From: Vegard Nossum
Date: Friday, July 11, 2008 - 8:00 am

BTW, did the new kernel fail in exactly the same place? If not, you
should also replace the EIP in the new crash report on the addr2line
command line, so in general: addr2line -e vmlinux -f -i <EIP here>.

(I don't think it really makes sense for the kernel to crash in
kmem_cache_alloc() this early in the boot process, so I'm guessing you
have a different EIP.)

(Also don't rebuild a bad kernel just to try this out again, but if
you happen to run across another bad one for example during bisection,
you can try it then.)


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
From: Török Edwin
Date: Friday, July 11, 2008 - 8:27 am

Yep, I am using ccache, same sources -> same binary.

addr2line -i now says:
/var/local/src/linux-2.6.git/linux-2.6/mm/slub.c:1648
/var/local/src/linux-2.6.git/linux-2.6/mm/slub.c:1662

Strangely the EIP is the same even after rebuilding with debug info.

Since tip/master supports the latency tracing features, I won't dig
further into the linux-next problem now
(we'll know tomorrow if the commits in tip solve the boot problem).

Best regards,
--Edwin

--

From: Thomas Meyer
Date: Wednesday, July 16, 2008 - 2:11 pm

I have got the same problem as Edwin TÖRÖK: From next-20080710 to next-20080711 the kernel fails to boot.
EIP seems to be in function kmem_cache_alloc.

This is also true for next-20080716.
I didn't try the kernels > next-20080711 and < next-20080716.

Any news on this bug?

My config is:
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.26
# Wed Jul 16 21:28:31 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General ...
From: Frédéric Weisbecker
Date: Wednesday, July 16, 2008 - 2:57 pm

Yes a fix has been released:
http://article.gmane.org/gmane.linux.kernel.kexec/1882

But I don't know when it will be applied....
--

From: Vegard Nossum
Date: Friday, July 11, 2008 - 6:36 am

T24gRnJpLCBKdWwgMTEsIDIwMDggYXQgMToxMiBQTSwgVMO2csO2ayBFZHdpbiA8ZWR3aW50b3Jv
a0BnbWFpbC5jb20+IHdyb3RlOgo+IEhpLAo+Cj4gVG9kYXkncyBsaW51eC1uZXh0IHRyZWUgKGNv
bW1pdAo+IDkzODQ3MDgzZTQ3OTE1Njc5MzFiZDE3YzAzOWNjMzU4ODFjZGFkMjkpIGZhaWxzIHRv
IGJvb3Q6Cj4gW2J1aWx0IHdpdGggZ2NjLTQuMi40LTNdCj4KPiBCVUc6IEludCAxNDogQ1IyIGIw
MDQ5ZGVhCj4gICAgIEVESSAwMDAwMDA4MiBFU0kgMDAwMDAwMDAgRUJQIGMwNTliZTg4IEVTUCBj
MDU5YmU1Ywo+ICAgICBFQlggZjAwMGVjNjIgRURYIDAwMDAwMDBlIEVDWCBjMDU5NTQ4MCBFQVgg
ZjAwMGVjNjIKPiAgICAgZXJyIDAwMDAwMDAwIEVJUCBjMDE4MWNhMCAgQ1MgMDAwMDAwNjAgZmxn
IDAwMDEwMDgyCj4gU3RhY2s6ICAgMDAwMDAwNDAgYzA2YTJiYTAgMDAwMDgwZDAgYzA1OTU0ODAg
YzAwMDBmMTljIGMwMDBmMTgwIGMwNTgxMTIwCj4gYzA1OWJlYTgKPiAgICAgICAgIGMwMmJmMTli
IDAwMDAwMDAwIDAwMDAwMDgwIGMwNTliZWI4IGMwMDAwZjE5NCBjMDAwZjE4MCAwMDAwMDAwYQo+
IGMwNTliZWI4Cj4gICAgICAgICBjMDNhMTA1OSAwMDAwMDAwMCAwMDAwMDAwMCBjMDU5YmVkOCBj
MDVjNGM3YyAgMDAwOWVmZmYgMDAwMDAwMDAKPiBjMDRmNGRmNAoKSGksCgpPbmUgcmVhbGx5IHNp
bXBsZSB3YXkgb2YgZ2V0dGluZyBzb21lIG1vcmUgaW5mbyBvdXQgb2YgdGhpcyBpcyB0byB0YWtl
CnRoZSBFSVAgdmFsdWUgKGhlcmUgYzAxODFjYTApIGFuZCBydW4gaXQgdGhyb3VnaCBhZGRyMmxp
bmU6CgogICAgJCBhZGRyMmxpbmUgLWUgdm1saW51eCAtaSBjMDE4MWNhMAoKQnV0IHlvdSBuZWVk
IHRvIG1ha2Ugc3VyZSB0aGF0IHRoZSBiekltYWdlL3ZtbGludXggeW91IGJvb3RlZApjb3JyZXNw
b25kcyB0byB0aGUgdm1saW51eCB5b3UgYXJlIHJ1bm5pbmcgYWRkcjJsaW5lIGFnYWluc3QuCgpU
aGlzIHdpbGwgdGVsbCB5b3UgdGhlIHNvdXJjZSBsaW5lIHdoaWNoIHByb2R1Y2VkIHRoZSBwYWdl
IGZhdWx0IGFuZAp3aWxsIHByb2JhYmx5IGdpdmUgYSBnb29kIGNsdWUgYXMgdG8gd2hhdCB3ZW50
IHdyb25nLgoKVGhhbmtzIGZvciByZXBvcnRpbmcgOi0pCgoKVmVnYXJkCgotLSAKIlRoZSBhbmlt
aXN0aWMgbWV0YXBob3Igb2YgdGhlIGJ1ZyB0aGF0IG1hbGljaW91c2x5IHNuZWFrZWQgaW4gd2hp
bGUKdGhlIHByb2dyYW1tZXIgd2FzIG5vdCBsb29raW5nIGlzIGludGVsbGVjdHVhbGx5IGRpc2hv
bmVzdCBhcyBpdApkaXNndWlzZXMgdGhhdCB0aGUgZXJyb3IgaXMgdGhlIHByb2dyYW1tZXIncyBv
d24gY3JlYXRpb24uIgoJLS0gRS4gVy4gRGlqa3N0cmEsIEVXRDEwMzYK
--

From: Frédéric Weisbecker
Date: Monday, July 14, 2008 - 7:11 pm

From: Takashi Iwai
Date: Tuesday, July 15, 2008 - 4:06 am

At Tue, 15 Jul 2008 04:11:26 +0200,

Confirmed that this fixes the boot problem on my machine, too.
(It explains why this happens only on x86-32...)

Added Bernhard to Cc.  Maybe we should defer firmware_map_add*()?


thanks,

--

From: Bernhard Walle
Date: Tuesday, July 15, 2008 - 4:15 am

Yes. I already posted a patch for that.

http://article.gmane.org/gmane.linux.kernel.kexec/1882



Bernhard
-- 
Bernhard Walle, SUSE LINUX Products GmbH, Architecture Development
--

From: Bernhard Walle
Date: Tuesday, July 15, 2008 - 4:17 am

Which does not defer firmware_map_add*() but the kobject
initialisation. Which also fixes the problem.


Bernhard
-- 
Bernhard Walle, SUSE LINUX Products GmbH, Architecture Development
--

From: Takashi Iwai
Date: Tuesday, July 15, 2008 - 4:53 am

At Tue, 15 Jul 2008 13:17:44 +0200,

Thanks, it's good to know that the fix is already pending.


Takashi
--

From: Bernhard Walle
Date: Tuesday, July 15, 2008 - 5:02 am

What's the process to get that to linux-next? Ingo's tip tree.



Bernhard
-- 
Bernhard Walle, SUSE LINUX Products GmbH, Architecture Development
--

Previous thread: [PATCH] x86/pci: Changing subsystem init for visws by Robert Richter on Friday, July 11, 2008 - 3:26 am. (1 message)

Next thread: Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.3.0: Introduction by Ryo Tsuruta on Friday, July 11, 2008 - 4:14 am. (3 messages)