Re: 2.6.26-git0: IDE oops during boot

Previous thread: loopback interfaces by mokhtar on Wednesday, February 6, 2008 - 2:51 am. (1 message)

Next thread: [bug] networking broke, ssh: connect to port 22: Protocol error by Ingo Molnar on Wednesday, February 6, 2008 - 4:38 am. (9 messages)
From: Pavel Machek
Date: Wednesday, February 6, 2008 - 4:08 am

Disabling CONFIG_IDE made my machine boot, as it was using libata
anyway.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Bartlomiej Zolnierkiewicz
Date: Wednesday, February 6, 2008 - 1:05 pm

Hi,



this comes from ide-generic

Kamalesh/Pavel:

Could you try latest git and see if the OOPS is still there?

[ Yeah, I'm unable to reproduce it. :( ]

Thanks,
Bart
--

From: Kamalesh Babulal
Date: Thursday, February 7, 2008 - 2:35 am

Hi Bart,

The panic is reproducible with the 2.6.24-git16 kernel, the call trace is
similar to the previous one

BUG: unable to handle kernel paging request at ffffffffffffffa0
IP: [<ffffffff80415673>] init_irq+0x188/0x444
PGD 203067 PUD 204067 PMD 0 
Oops: 0000 [1] SMP 
CPU 3 
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.24-git16 #1
RIP: 0010:[<ffffffff80415673>]  [<ffffffff80415673>] init_irq+0x188/0x444
RSP: 0000:ffff81022f093e00  EFLAGS: 00010282
RAX: ffffffffffffff80 RBX: ffffffff808ad200 RCX: 0000000000000000
RDX: 00000000ffffffff RSI: ffff81022fc039c0 RDI: ffffffff807512c0
RBP: ffff81022f093e30 R08: ffff81022f093d70 R09: 0000000000000002
R10: 0000000000000001 R11: ffff81022f093c00 R12: ffffffff808b4500
R13: ffffffff808b4510 R14: 0000000000000000 R15: ffffffffffffffff
FS:  0000000000000000(0000) GS:ffff81022f0e7ac0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffffffffffffffa0 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff81022f092000, task ffff81022f0797e0)
Stack:  ffff81022f093e30 0000000000000000 ffffffff808ad200 ffffffff808ad220
 ffffffff808add80 0000000000000000 ffff81022f093eb0 ffffffff8041648f
 ffff81022f093ec0 0000000000000000 0000000080751ee0 0000000000000246
Call Trace:
 [<ffffffff8041648f>] ide_device_add_all+0xb60/0xe54
 [<ffffffff807d6d48>] ide_generic_init+0x46/0x4a
 [<ffffffff807b873b>] kernel_init+0x175/0x2e7
 [<ffffffff8020bff8>] child_rip+0xa/0x12
 [<ffffffff8037476c>] acpi_ds_init_one_object+0x0/0x88
 [<ffffffff807b85c6>] kernel_init+0x0/0x2e7
 [<ffffffff8020bfee>] child_rip+0x0/0x12


Code: 89 03 49 8b 45 18 48 89 18 48 39 1b 75 04 0f 0b eb fe fe 05 20 71 38 00 fb eb 5b 48 8b 83 20 07 00 00 83 ca ff 48 83 c0 80 74 0e <48> 8b 40 20 48 8b 80 88 00 00 00 8b 50 04 48 8b 3d 48 11 30 00 
RIP  [<ffffffff80415673>] init_irq+0x188/0x444
 ...
From: Bartlomiej Zolnierkiewicz
Date: Thursday, February 7, 2008 - 7:01 am

Thanks, I again reviewed ide-probe.c changes but nothing seems wrong...


Please also try disassembling init_irq using gdb so we see where it fails.

--

From: Nish Aravamudan
Date: Sunday, February 10, 2008 - 2:32 pm

Kamalesh, were you able to bisect this down? I just got hit by the
same panic on a 4-way x86_64, with 2.6.24-git22.

Thanks,
Nish
--

From: Kamalesh Babulal
Date: Monday, February 11, 2008 - 12:54 am

Hi Nish,

I tried bisecting and the guilty patch seems to be 

36501650ec45b1db308c3b51886044863be2d762 is first bad commit
commit 36501650ec45b1db308c3b51886044863be2d762
Author: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Date:   Fri Feb 1 23:09:31 2008 +0100

    ide: keep pointer to struct device instead of struct pci_dev in ide_hwif_t


the gdb output, also points to the changes made by the guilty patch

(gdb) p ide_device_add_all
$1 = {int (u8 *, const struct ide_port_info *)} 0xffffffff804176ac <ide_device_add_all>
(gdb) p/x 0xffffffff804176ac+0xb60
$2 = 0xffffffff8041820c
(gdb) l *0xffffffff8041820c
0xffffffff8041820c is in ide_device_add_all (drivers/ide/ide-probe.c:1249).
1244                    goto out;
1245            }
1246
1247            sg_init_table(hwif->sg_table, hwif->sg_max_nents);
1248
1249            if (init_irq(hwif) == 0)
1250                    goto done;
1251
1252            old_irq = hwif->irq;
1253            /*
(gdb) 


(gdb) p init_irq
$1 = {int (ide_hwif_t *)} 0xffffffff8041721f <init_irq>
(gdb) p/x 0xffffffff8041721f+0x1a4
$2 = 0xffffffff804173c3
(gdb) l *0xffffffff804173c3
0xffffffff804173c3 is in init_irq (include/asm/pci.h:101).
96      /* Returns the node based on pci bus */
97      static inline int __pcibus_to_node(struct pci_bus *bus)
98      {
99              struct pci_sysdata *sd = bus->sysdata;
100
101             return sd->node;
102     }
103
104     static inline cpumask_t __pcibus_to_cpumask(struct pci_bus *bus)
105     {
(gdb) 


-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--

From: Bartlomiej Zolnierkiewicz
Date: Monday, February 11, 2008 - 12:35 pm

Hi,


Thanks for the detailed analysis and sorry for the bug.

I think that this may has been just fixed by Andi's recent hwif_to_node()
fix (patch below, it is in Linus' tree already), could please verify this?

commit 1f07e988290fc45932f5028c9e2a862c37a57336
Author: Andi Kleen <andi@firstfloor.org>
Date:   Mon Feb 11 01:35:20 2008 +0100

    Prevent IDE boot ops on NUMA system
    
    Without this patch a Opteron test system here oopses at boot with
    current git.
    
    Calling to_pci_dev() on a NULL pointer gives a negative value so the
    following NULL pointer check never triggers and then an illegal address
    is referenced.  Check the unadjusted original device pointer for NULL
    instead.
    
    Signed-off-by: Andi Kleen <ak@suse.de>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/include/linux/ide.h b/include/linux/ide.h
index 23fad89..a3b69c1 100644
--- a/include/linux/ide.h
+++ b/include/linux/ide.h
@@ -1295,7 +1295,7 @@ static inline void ide_dump_identify(u8 *id)
 static inline int hwif_to_node(ide_hwif_t *hwif)
 {
 	struct pci_dev *dev = to_pci_dev(hwif->dev);
-	return dev ? pcibus_to_node(dev->bus) : -1;
+	return hwif->dev ? pcibus_to_node(dev->bus) : -1;
 }
 
 static inline ide_drive_t *ide_get_paired_drive(ide_drive_t *drive)
--

From: Kamalesh Babulal
Date: Tuesday, February 12, 2008 - 2:04 am

Hi Bart,
Thanks !! the patch solves the kernel panic but when after applying the patch,kernel is not
able to mount the filesystem and panics, am i not sure what is likely causing the panic.

Creating root device.
Mounting root filesystem.
mount: could not  find filesystem
Kernel panic - not syncing: Attempted to kill init!


-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--

From: Bartlomiej Zolnierkiewicz
Date: Wednesday, February 13, 2008 - 4:00 pm

Hi,


Is

- the commit 36501650ec45b1db308c3b51886044863be2d762 with Andi's fix applied

or

- the commit f6fb786d6dcdd7d730e4fba620b071796f487e1b
  (the one before commit 36501650ec45b1db308c3b51886044863be2d762)


Is IDE actually used for the boot device?

[ Please send a dmesg output from the working system. ]

Thanks,
Bart
--

From: Kamalesh Babulal
Date: Thursday, February 14, 2008 - 2:46 am

No, the commit before the commit 36501650ec45b1db308c3b51886044863be2d762 did not either work, i


-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
From: Yinghai Lu
Date: Thursday, February 14, 2008 - 3:28 am

On Thu, Feb 14, 2008 at 1:46 AM, Kamalesh Babulal

it seems you have enclosure connected.

please check if you enable the SES in .config.

if so, please try

http://lkml.org/lkml/2008/2/13/673

YH
--

From: Kamalesh Babulal
Date: Friday, February 15, 2008 - 4:15 am

Hi,

Thanks for pointing the patch, I do not have the SES config option enabled, 
then too i tried your patch, but that does not solve the panic. The kernel
panic's with the same panic message as before. I have attached the .config 
file which i am using, please let me know if i am missing out/getting wrong
any option in the configuration.



-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
---




From: Yinghai Lu
Date: Monday, February 25, 2008 - 12:05 am

On Fri, Feb 15, 2008 at 3:15 AM, Kamalesh Babulal

can you try x86.git#testing?

http://people.redhat.com/mingo/x86.git/README

YH
--

From: Yinghai Lu
Date: Monday, February 25, 2008 - 12:23 am

and try attached patch.

YH
From: Bartlomiej Zolnierkiewicz
Date: Thursday, February 14, 2008 - 5:01 am

Hi,


Hmm, it is not (from dmesg):

Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Probing IDE interface ide0...
hda: HL-DT-STCD-RW/DVD DRIVE GCC-4244N, ATAPI CD/DVD-ROM drive
Probing IDE interface ide1...
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: ATAPI 24X DVD-ROM CD-R/RW drive, 2048kB Cache
Uniform CD-ROM driver Revision: 3.20

[...]

Adaptec aacraid driver 1.1-5[2449]-ms
ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 25
AAC0: kernel 5.2-0[11835] Jan  9 2007
AAC0: monitor 5.2-0[11835]
AAC0: bios 5.2-0[11835]
AAC0: serial 1625D1
AAC0: 64bit support enabled.
AAC0: 64 Bit DAC enabled
scsi0 : ServeRAID
scsi 0:0:0:0: Direct-Access     IBM      x366             V1.0 PQ: 0 ANSI: 2
scsi 0:1:0:0: Direct-Access     IBM-ESXS ST973401SS       B519 PQ: 0 ANSI: 5
scsi 0:1:1:0: Direct-Access     IBM-ESXS ST973401SS       B519 PQ: 0 ANSI: 5
scsi 0:1:2:0: Direct-Access     IBM-ESXS ST973401SS       B519 PQ: 0 ANSI: 5
scsi 0:3:0:0: Enclosure         IBM      SAS SES-2 DEVICE 0.09 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 429459456 512-byte hardware sectors (219883 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 06 00 10 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
sd 0:0:0:0: [sda] 429459456 512-byte hardware sectors (219883 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 06 00 10 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
sd 0:0:0:0: [sda] Attached SCSI removable disk
sd 0:0:0:0: Attached scsi generic sg0 type 0
scsi 0:1:0:0: Attached scsi generic sg1 type 0
scsi 0:1:1:0: Attached scsi generic sg2 type 0
scsi 0:1:2:0: Attached scsi generic sg3 type 0
scsi 0:3:0:0: Attached scsi generic sg4 type 13

[...]

kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data ...
From: Bartlomiej Zolnierkiewicz
Date: Thursday, February 14, 2008 - 5:07 am

Yinghai Lu noticed that it may be actually a SES problem:

http://lkml.org/lkml/2008/2/14/88

[ I overlooked the above mail, sorry ]
--

From: James Bottomley
Date: Thursday, February 14, 2008 - 8:47 am

Only if SES is enabled, is it (CONFIG_SCSI_ENCLOSURE)? ... is there
actually a dmesg of the failing system somewhere, I couldn't find it in
the (somewhat long) thread?

James




--

Previous thread: loopback interfaces by mokhtar on Wednesday, February 6, 2008 - 2:51 am. (1 message)

Next thread: [bug] networking broke, ssh: connect to port 22: Protocol error by Ingo Molnar on Wednesday, February 6, 2008 - 4:38 am. (9 messages)