Hi KML
I am installing gentoo 2007.0 (kernel 2.6.19) on a dual AMD Opteron server (total of 4 cores). The hard disk is a Stardom 2611-2S-S1 device: actually two 250GB drives in a RAID0 config managed by the device itself - it should appear to the kernel as one SATA drive. If it matters, the underlying HDs are "Seagate Barracuda 7200 10"s. Here's the device:
http://www.synetic.net/Synetic-Products/Stardoms/SR-2611-SA/Stardom-2611...
During the install and at different points in the process I get an "HSM violation" and the system becomes unresponsive. It looks like a similar situation to:
http://lkml.org/lkml/2007/6/6/195
Will more recent kernels work with this hardware (should I keep it and try the install again) or should I switch hardware to something more compatible (like an Adaptec card)?
Thanks!
Bryan--
console output:tag 0 cmd 0x39 Emask 0x2 stat 0x58 err 0x0 (HSM violation)
exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen--
Output from hdparm -I /dev/sda:/dev/sda:
ATA device, with non-removable media
Model Number: STARDOM V.36.A0B
Serial Number:
Firmware Revision: V.36.A0B
Standards:
Used: ATA/ATAPI-6 T13 1410D revision 0[snip]
Commands/features:
Enabled Supported:
* SMART feature set
* Power Management feature set
* Advanced Power Management feature set
* 48-bit Address feature set
* Mandatory FLUSH_CACHE
* SATA-I signaling speed (1.5 Gb/s)
* SATA-II signaling speed (3.0 Gb/s)--
Parts of dmesg:
libata version 2.00 loaded
sata_nv 0000:00:05.0: version 2.0
ata1: SATA max UDMA/133 cmd 0xD480 ctl 0xD402 bmdma 0xCC00 irq 21
ata2: SATA max UDMA/133 cmd 0xD080 ctl 0xD002 bmdma 0xCC08 irq 21
scsi0 : sata_nv
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ATA-6, max UDMA/133,976794112 sectors: LBA48
ata1.00: ata1: dev 0 multi count 1
ata1.00: applying bridge limits
ata1.00: configured for UDMA/100-
Hi,
[Adding linux-ide to CC]
Regards,
Michal--
LOG
http://www.stardust.webpages.pl/log/
-
Please post full dmesg and full 'hdparm -I' result. Also, if possible,
please try 2.6.22.5. Even if it doesn't fix the problem, it would
report error conditions better.--
tejun
-
The full dmesg and hdparm -I command output are attached.
I have received word from the vendor that the Stardom 2611 will do
RAID0 or 1 under windows, but only RAID1 under Linux. (Their manual
said it worked with Linux but failed to mention the RAID mode
restriction: argh!)They recommended the 2600 model for RAID0 with Linux, but that model
is only SATA-I so I will probably go with alternate hardware.The vendor also suggested the possibility of a firmware upgrade to
the 2611 - I am still waiting to hear. I will post a followup if
this happens.Thanks all for your help and suggestions!
Regards,
Bryan
If possible, please post dmesg from 2.6.22.5.
Thanks.
--
tejun
-
Presumably in the week and a half between Bryan's report and your request,
Bryan has gone off and got an adaptec card. Bryan, it would be helpful if
you could rebuild the original systam and help us get to the bottom of this
bug, thanks.-
I reported a very similar bug back a few releases ago.
Anyone who wants to try it themselves, can do this with hdparm-7.7 (from sourceforge):hdparm --drq-hsm-error /dev/sda
Whether or not it hangs the machine does depend upon exactly which SATA LLD is used,
and what model/revision of drive is installed. But if it hangs for you (eg. Tejun),
then you now have a way to reproduce a HSM error "on demand" for testing. :)Cheers
-
Hello,
Neat. Is this the FIFO-draining issue?
Thanks.
--
tejun
-
Yeah, that's the one. And I still patch my own kernels to
automatically drain up to 512 words from the FIFO when this happens.Works like a charm. Patch below for demonstration purposes.
Signed-Off-By: Mark Lord <mlord@pobox.com>
------ linux/drivers/ata/libata-sff.c.orig 2007-04-26 12:02:46.000000000 -0400
+++ linux/drivers/ata/libata-sff.c 2007-04-29 08:29:27.000000000 -0400
@@ -413,6 +413,24 @@
ap->ops->irq_on(ap);
}+static void ata_drain_fifo (struct ata_port *ap, struct ata_queued_cmd *qc)
+{
+ u8 stat = ata_chk_status(ap);
+ /*
+ * Try to clear stuck DRQ if necessary.
+ */
+ if ((stat & ATA_DRQ) && (!qc || qc->dma_dir != DMA_TO_DEVICE)) {
+ unsigned int i, limit = 512;
+ printk("Draining up to %u words from data FIFO.\n", limit);
+ for (i = 0; i < limit ; ++i) {
+ ioread16(ap->ioaddr.data_addr);
+ if (!(ata_chk_status(ap) & ATA_DRQ))
+ break;
+ }
+ printk("Drained %u/%u words.\n", i, limit);
+ }
+}
+
/**
* ata_bmdma_drive_eh - Perform EH with given methods for BMDMA controller
* @ap: port to handle error for
@@ -469,7 +487,7 @@
}ata_altstatus(ap);
- ata_chk_status(ap);
+ ata_drain_fifo(ap, qc);
ap->ops->irq_clear(ap);spin_unlock_irqrestore(ap->lock, flags);
-
I think there have been enough cases where this draining was necessary.
IIRC, ata_piix was involved in those cases, right? If so, can you
please submit a patch which applies this only to affected controllers?
I don't feel too confident about applying this to all SFF controllers.Thanks.
--
tejun-
Old IDE does it on all controllers bar a couple. So we have a very good
knowledge of what does/doesn't work. The one that needs care in old ide
is an ordering issue where a state machine reset done first causes the
drain of the I/O to hang.
-
Hmmm... So, do we apply draining to all PATA? Or is ata_piix SATA
affected too?--
tejun
-
ata_piix SATA is definitely affected when a PATA_drive to SATA_host bridge is present.
Possibly other times.Cheers
-
I would think all SFF controllers, since a lot of first gen SATA are
really bridged solutions. If they are flagging DRQ, I say oblige them :)Jeff
-
Alright, then the posted patch should be good enough. Mark, can you be
bothered to regenerate the patch and post it one more time (again)? It
seems we all agree the update is needed.Thanks a lot.
--
tejun
-
I think this original patch still applies cleanly on at least 2.6.23-rc7.
Drain up to 512 words from host/bridge FIFO on stuck DRQ HSM violation,
rather than just getting stuck there forever.Signed-Off-By: Mark Lord <mlord@pobox.com>
------ old/drivers/ata/libata-sff.c 2007-04-26 12:02:46.000000000 -0400
+++ linux/drivers/ata/libata-sff.c 2007-04-29 08:29:27.000000000 -0400
@@ -413,6 +413,24 @@
ap->ops->irq_on(ap);
}+static void ata_drain_fifo (struct ata_port *ap, struct ata_queued_cmd *qc)
+{
+ u8 stat = ata_chk_status(ap);
+ /*
+ * Try to clear stuck DRQ if necessary.
+ */
+ if ((stat & ATA_DRQ) && (!qc || qc->dma_dir != DMA_TO_DEVICE)) {
+ unsigned int i, limit = 512;
+ printk("Draining up to %u words from data FIFO.\n", limit);
+ for (i = 0; i < limit ; ++i) {
+ ioread16(ap->ioaddr.data_addr);
+ if (!(ata_chk_status(ap) & ATA_DRQ))
+ break;
+ }
+ printk("Drained %u/%u words.\n", i, limit);
+ }
+}
+
/**
* ata_bmdma_drive_eh - Perform EH with given methods for BMDMA controller
* @ap: port to handle error for
@@ -469,7 +487,7 @@
}ata_altstatus(ap);
- ata_chk_status(ap);
+ ata_drain_fifo(ap, qc);
ap->ops->irq_clear(ap);spin_unlock_irqrestore(ap->lock, flags);
-
ap->ops->cleanup();
might be wiser
-
Though I have queued Mark's patch to be applied, my gut feeling would
If someone needs that, they can override the error handler with their
own. No need for a new op.Jeff
-
PDC202xx needs
-
Alan, you're the expert there (my condolences!).
Can you generate a fix for the PDC202xx to go with this?Cheers
-
Actually, I belileve we should base it on qc->sect_size instead.
Then, if somebody also would like to submit a patch introducing
a cleanup() method, then please do so!As a separate patch, though (seems to be the "libata way").
* * * *I think this original patch still applies cleanly on at least 2.6.23-rc7.
Drain up to 512 words from host/bridge FIFO on stuck DRQ HSM violation,
rather than just getting stuck there forever.Signed-off-by: Mark Lord <mlord@pobox.com>
------ old/drivers/ata/libata-sff.c 2007-09-28 09:29:22.000000000 -0400
+++ linux/drivers/ata/libata-sff.c 2007-09-28 09:39:44.000000000 -0400
@@ -420,6 +420,28 @@
ap->ops->irq_on(ap);
}+static void ata_drain_fifo(struct ata_port *ap, struct ata_queued_cmd *qc)
+{
+ u8 stat = ata_chk_status(ap);
+ /*
+ * Try to clear stuck DRQ if necessary,
+ * by reading/discarding up to two sectors worth of data.
+ */
+ if ((stat & ATA_DRQ) && (!qc || qc->dma_dir != DMA_TO_DEVICE)) {
+ unsigned int i;
+ unsigned int limit = qc ? qc->sect_size : ATA_SECT_SIZE;
+
+ printk(KERN_WARNING "Draining up to %u words from data FIFO.\n",
+ limit);
+ for (i = 0; i < limit ; ++i) {
+ ioread16(ap->ioaddr.data_addr);
+ if (!(ata_chk_status(ap) & ATA_DRQ))
+ break;
+ }
+ printk(KERN_WARNING "Drained %u/%u words.\n", i, limit);
+ }
+}
+
/**
* ata_bmdma_drive_eh - Perform EH with given methods for BMDMA controller
* @ap: port to handle error for
@@ -476,7 +498,7 @@
}ata_altstatus(ap);
- ata_chk_status(ap);
+ ata_drain_fifo(ap, qc);
ap->ops->irq_clear(ap);spin_unlock_irqrestore(ap->lock, flags);
-
applied, after hand-editing out the top of the message, so that it would
not be copied into the kernel changelog-
Acked-by: Tejun Heo <htejun@gmail.com>
--
tejun
-
Nacked-by: scripts/checkpatch.pl
-
> Nacked-by: scripts/checkpatch.pl
Mark, it seems you'll have to get ACK from this dude first. :-)
--
tejun
-
Hey, we just found something which doesn't crash my Vaio!
sony:/home/akpm/hdparm-7.7> 0 ./hdparm --drq-hsm-error /dev/sda
/dev/sda:
triggering "stuck DRQ" host state machine error
do_drq_hsm_error: Success
ata status=0x58 ata error=0x00ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/40 tag 0 cdb 0x0 data 0
res 58/00:01:00:00:00/00:00:00:00:00/40 Emask 0x2 (HSM violation)
ata3: soft resetting port
ata3.00: configured for UDMA/100
ata3: EH complete
sd 2:0:0:0: [sda] 195371568 512-byte hardware sectors (100030 MB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUAHow dull. (ata_piix)
-
On my two very similar notebooks, it crashes libata when a PATA drive is used
behind a Marvell converter chip, but not when a SATA drive is used directly.Cheers
-
