Re: sata_sil24 broken since 2.6.23-rc4-mm1

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Tejun Heo <htejun@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, <akpm@...>
Date: Sunday, September 30, 2007 - 2:00 am

On 9/28/07, Torsten Kaiser <just.for.lkml@googlemail.com> wrote:

That's still the case.


That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.

[   53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[   77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[   77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[   77.170909]          res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
[   77.504292] ata1: soft resetting port
[   77.544221] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   77.642567] ata1.00: configured for UDMA/100
[   77.642572] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE,SUGGEST_OK
[   77.642574] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current]
[descriptor]
[   77.642578] Descriptor sense data with sense descriptors (in hex):
[   77.642580]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[   77.642584]         00 00 00 af
[   77.642586] sd 0:0:0:0: [sda] Add. Sense: No additional sense information
[   77.642589] end_request: I/O error, dev sda, sector 625137161
[   77.642593] ata1: EH complete
[   77.642609] sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
[   77.642616] sd 0:0:0:0: [sda] Write Protect is off
[   77.642618] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[   77.642628] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[   77.666927] md: super_written gets error=-5, uptodate=0
[   77.666930] raid5: Disk failure on sda2, disabling device.
Operation continuing on 1 devices

That "Operation continuing on 1 devices" is a 'little' bit misleading.
A RAID5 with two failed devices will not continue to operate. :(
The same error is repeated several times, so I expect the first error
also looked like that.

Other things I have done to narrow it down:
Comparing 2.6.23-rc3-mm1 and 2.6.23-rc4-mm1 I found the following hunk
from libata-core.c:
@@ -2182,6 +2183,17 @@
        if (ap->ops->cable_detect)
                ap->cbl = ap->ops->cable_detect(ap);

+       /* We may have SATA bridge glue hiding here irrespective of the
+          reported cable types and sensed types */
+       ata_link_for_each_dev(dev, &ap->link) {
+               if (!ata_dev_enabled(dev))
+                       continue;
+               /* SATA drives indicate we have a bridge. We don't know which
+                  end of the link the bridge is which is a problem */
+               if (ata_id_is_sata(dev->id))
+                       ap->cbl = ATA_CBL_SATA;
+       }
+

But removing this from the current -mm does still show the error, so
thats not the source of the bug. (That test kicked the first drive
(sdb) from the RAID5 with the same error as most times)

Rebooting into the "safe" rc7-sglist-kernel I got the above error that
killed the RAID5.
I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
2.6.*21*, that is only a rescue system and not updated often) on a
separate partition to rebuild it. When I tried to readd the failed sdb
the system locked up with this error:

Sep 29 22:25:37 treogen [  205.407893] ata2.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x2 frozen
Sep 29 22:25:37 treogen [  205.407900] ata2.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Sep 29 22:25:37 treogen [  205.407901]          res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:25:38 treogen [  205.720531] ata2: soft resetting port
Sep 29 22:25:38 treogen [  205.720554] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:25:38 treogen [  205.730560] ATA: abnormal status 0xEA on
port 0xffffc20001432007
Sep 29 22:25:38 treogen [  205.740560] ATA: abnormal status 0xEA on
port 0xffffc20001432007
Sep 29 22:26:07 treogen [  235.343001] ata1.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x2 frozen
Sep 29 22:26:07 treogen [  235.343007] ata1.00: cmd
61/08:00:bf:52:2a/00:00:01:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 29 22:26:07 treogen [  235.343009]          res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:26:08 treogen [  235.655644] ata1: soft resetting port
Sep 29 22:26:08 treogen [  235.655666] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:08 treogen [  235.675602] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:26:08 treogen [  235.675608] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:08 treogen [  235.675610] ata2.00: revalidation failed (errno=-5)
Sep 29 22:26:08 treogen [  235.675612] ata2: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:13 treogen [  240.668113] ata2: hard resetting port
Sep 29 22:26:15 treogen [  242.833438] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:15 treogen [  242.843444] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:15 treogen [  242.853443] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:38 treogen [  265.590759] ata1.00: qc timeout (cmd 0xec)
Sep 29 22:26:38 treogen [  265.590765] ata1.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:38 treogen [  265.590767] ata1.00: revalidation failed (errno=-5)
Sep 29 22:26:38 treogen [  265.590769] ata1: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:43 treogen [  270.583271] ata1: hard resetting port
Sep 29 22:26:45 treogen [  272.748596] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:45 treogen [  272.758601] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:26:45 treogen [  272.768602] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:26:45 treogen [  272.788490] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:26:45 treogen [  272.788494] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:45 treogen [  272.788496] ata2.00: revalidation failed (errno=-5)
Sep 29 22:26:45 treogen [  272.788501] ata2.00: limiting speed to UDMA/100:PIO3
Sep 29 22:26:45 treogen [  272.788503] ata2: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:50 treogen [  277.781001] ata2: hard resetting port
Sep 29 22:26:52 treogen [  279.946326] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:52 treogen [  279.956331] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:52 treogen [  279.966330] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:27:15 treogen [  302.703652] ata1.00: qc timeout (cmd 0xec)
Sep 29 22:27:15 treogen [  302.703661] ata1.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:27:15 treogen [  302.703663] ata1.00: revalidation failed (errno=-5)
Sep 29 22:27:15 treogen [  302.703669] ata1.00: limiting speed to UDMA/100:PIO3
Sep 29 22:27:15 treogen [  302.703671] ata1: failed to recover some
devices, retrying in 5 secs
Sep 29 22:27:20 treogen [  307.696161] ata1: hard resetting port
Sep 29 22:27:22 treogen [  309.884773] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:27:22 treogen [  309.894784] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:27:22 treogen [  309.904786] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:27:22 treogen [  309.904940] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:27:22 treogen [  309.904946] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:27:22 treogen [  309.904948] ata2.00: revalidation failed (errno=-5)
Sep 29 22:27:22 treogen [  309.904951] ata2.00: disabled
Sep 29 22:27:23 treogen [  310.406968] ata2: EH complete
Sep 29 22:27:23 treogen [  310.407013] sd 1:0:0:0: [sdb] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Sep 29 22:27:23 treogen [  310.407018] end_request: I/O error, dev
sdb, sector 19550911

Is it possible that the bug is much older, and only some change in
rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?

Currently I try a 2.6.23-rc4-mm1 with only 629 of 1628 mm-patches applied...

Torsten
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Wed Sep 26, 4:26 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Tejun Heo, (Thu Sep 27, 12:54 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Tejun Heo, (Thu Sep 27, 12:57 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Thu Sep 27, 2:14 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Jeff Garzik, (Thu Sep 27, 2:24 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Thu Sep 27, 1:34 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Tejun Heo, (Thu Sep 27, 4:22 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Fri Sep 28, 1:36 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Sun Sep 30, 2:00 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Tejun Heo, (Sun Sep 30, 10:34 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Sun Sep 30, 12:19 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Tejun Heo, (Sun Sep 30, 1:39 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Sun Sep 30, 2:39 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Mon Oct 1, 2:00 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Wed Oct 3, 11:21 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Wed Oct 3, 11:55 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Matt Mackall, (Wed Oct 3, 12:38 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Thu Oct 4, 1:32 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Matt Mackall, (Thu Oct 4, 1:05 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Fri Oct 5, 2:06 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Sun Oct 7, 4:44 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Sun Oct 7, 10:39 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Tejun Heo, (Wed Oct 10, 11:25 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Jens Axboe, (Thu Oct 11, 4:26 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Tejun Heo, (Thu Oct 11, 4:36 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Jens Axboe, (Thu Oct 11, 6:28 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Thu Oct 11, 1:54 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Tejun Heo, (Thu Oct 11, 2:26 am)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Thu Oct 11, 1:51 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Wed Oct 3, 1:36 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Matt Mackall, (Wed Oct 3, 1:51 pm)
Re: sata_sil24 broken since 2.6.23-rc4-mm1, Torsten Kaiser, (Wed Oct 3, 2:06 pm)