Stardom SATA HSM violation

Previous thread: Re: RFC: issues concerning the next NAPI interface by Mitchell Erblich on Friday, August 24, 2007 - 10:10 pm. (1 message)

Next thread: PROBLEM: Caught SIGFPE exceptions aren't reset by Clark Cooper on Saturday, August 25, 2007 - 12:40 am. (2 messages)
To: <linux-kernel@...>
Date: Friday, August 24, 2007 - 11:22 pm

Hi KML

I am installing gentoo 2007.0 (kernel 2.6.19) on a dual AMD Opteron server (total of 4 cores). The hard disk is a Stardom 2611-2S-S1 device: actually two 250GB drives in a RAID0 config managed by the device itself - it should appear to the kernel as one SATA drive. If it matters, the underlying HDs are "Seagate Barracuda 7200 10"s. Here's the device:

http://www.synetic.net/Synetic-Products/Stardoms/SR-2611-SA/Stardom-2611...

During the install and at different points in the process I get an "HSM violation" and the system becomes unresponsive. It looks like a similar situation to:

http://lkml.org/lkml/2007/6/6/195

Will more recent kernels work with this hardware (should I keep it and try the install again) or should I switch hardware to something more compatible (like an Adaptec card)?

Thanks!
Bryan

--
console output:

tag 0 cmd 0x39 Emask 0x2 stat 0x58 err 0x0 (HSM violation)
exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen

--
Output from hdparm -I /dev/sda:

/dev/sda:

ATA device, with non-removable media
Model Number: STARDOM V.36.A0B
Serial Number:
Firmware Revision: V.36.A0B
Standards:
Used: ATA/ATAPI-6 T13 1410D revision 0

[snip]

Commands/features:
Enabled Supported:
* SMART feature set
* Power Management feature set
* Advanced Power Management feature set
* 48-bit Address feature set
* Mandatory FLUSH_CACHE
* SATA-I signaling speed (1.5 Gb/s)
* SATA-II signaling speed (3.0 Gb/s)

--
Parts of dmesg:
libata version 2.00 loaded
sata_nv 0000:00:05.0: version 2.0
ata1: SATA max UDMA/133 cmd 0xD480 ctl 0xD402 bmdma 0xCC00 irq 21
ata2: SATA max UDMA/133 cmd 0xD080 ctl 0xD002 bmdma 0xCC08 irq 21
scsi0 : sata_nv
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ATA-6, max UDMA/133,976794112 sectors: LBA48
ata1.00: ata1: dev 0 multi count 1
ata1.00: applying bridge limits
ata1.00: configured for UDMA/100

-

To: <bryan@...>
Cc: <linux-kernel@...>, IDE/ATA development list <linux-ide@...>
Date: Sunday, August 26, 2007 - 7:10 pm

Hi,

[Adding linux-ide to CC]

Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/
-

To: Michal Piotrowski <michal.k.k.piotrowski@...>
Cc: <bryan@...>, <linux-kernel@...>, IDE/ATA development list <linux-ide@...>
Date: Monday, September 3, 2007 - 4:53 am

Please post full dmesg and full 'hdparm -I' result. Also, if possible,
please try 2.6.22.5. Even if it doesn't fix the problem, it would
report error conditions better.

--
tejun
-

To: <linux-kernel@...>, IDE/ATA development list <linux-ide@...>
Cc: Tejun Heo <htejun@...>, <Michal.k.k.Piotrowski@...>, Andrew Morton <akpm@...>
Date: Thursday, September 6, 2007 - 11:00 am

The full dmesg and hdparm -I command output are attached.

I have received word from the vendor that the Stardom 2611 will do
RAID0 or 1 under windows, but only RAID1 under Linux. (Their manual
said it worked with Linux but failed to mention the RAID mode
restriction: argh!)

They recommended the 2600 model for RAID0 with Linux, but that model
is only SATA-I so I will probably go with alternate hardware.

The vendor also suggested the possibility of a firmware upgrade to
the 2611 - I am still waiting to hear. I will post a followup if
this happens.

Thanks all for your help and suggestions!

Regards,
Bryan

To: <bryan@...>
Cc: <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, <Michal.k.k.Piotrowski@...>, Andrew Morton <akpm@...>
Date: Thursday, September 6, 2007 - 8:58 pm

If possible, please post dmesg from 2.6.22.5.

Thanks.

--
tejun
-

To: Tejun Heo <htejun@...>
Cc: <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Wednesday, September 5, 2007 - 12:53 pm

Presumably in the week and a half between Bryan's report and your request,
Bryan has gone off and got an adaptec card. Bryan, it would be helpful if
you could rebuild the original systam and help us get to the bottom of this
bug, thanks.

-

To: Andrew Morton <akpm@...>
Cc: Tejun Heo <htejun@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Wednesday, September 5, 2007 - 1:23 pm

I reported a very similar bug back a few releases ago.
Anyone who wants to try it themselves, can do this with hdparm-7.7 (from sourceforge):

hdparm --drq-hsm-error /dev/sda

Whether or not it hangs the machine does depend upon exactly which SATA LLD is used,
and what model/revision of drive is installed. But if it hangs for you (eg. Tejun),
then you now have a way to reproduce a HSM error "on demand" for testing. :)

Cheers
-

To: Mark Lord <liml@...>
Cc: Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Thursday, September 6, 2007 - 8:58 pm

Hello,

Neat. Is this the FIFO-draining issue?

Thanks.

--
tejun
-

To: Tejun Heo <htejun@...>
Cc: Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Friday, September 7, 2007 - 9:40 am

Yeah, that's the one. And I still patch my own kernels to
automatically drain up to 512 words from the FIFO when this happens.

Works like a charm. Patch below for demonstration purposes.

Signed-Off-By: Mark Lord <mlord@pobox.com>
---

--- linux/drivers/ata/libata-sff.c.orig 2007-04-26 12:02:46.000000000 -0400
+++ linux/drivers/ata/libata-sff.c 2007-04-29 08:29:27.000000000 -0400
@@ -413,6 +413,24 @@
ap->ops->irq_on(ap);
}

+static void ata_drain_fifo (struct ata_port *ap, struct ata_queued_cmd *qc)
+{
+ u8 stat = ata_chk_status(ap);
+ /*
+ * Try to clear stuck DRQ if necessary.
+ */
+ if ((stat & ATA_DRQ) && (!qc || qc->dma_dir != DMA_TO_DEVICE)) {
+ unsigned int i, limit = 512;
+ printk("Draining up to %u words from data FIFO.\n", limit);
+ for (i = 0; i < limit ; ++i) {
+ ioread16(ap->ioaddr.data_addr);
+ if (!(ata_chk_status(ap) & ATA_DRQ))
+ break;
+ }
+ printk("Drained %u/%u words.\n", i, limit);
+ }
+}
+
/**
* ata_bmdma_drive_eh - Perform EH with given methods for BMDMA controller
* @ap: port to handle error for
@@ -469,7 +487,7 @@
}

ata_altstatus(ap);
- ata_chk_status(ap);
+ ata_drain_fifo(ap, qc);
ap->ops->irq_clear(ap);

spin_unlock_irqrestore(ap->lock, flags);
-

To: Mark Lord <liml@...>
Cc: Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>, Jeff Garzik <jgarzik@...>
Date: Thursday, September 27, 2007 - 3:05 am

I think there have been enough cases where this draining was necessary.
IIRC, ata_piix was involved in those cases, right? If so, can you
please submit a patch which applies this only to affected controllers?
I don't feel too confident about applying this to all SFF controllers.

Thanks.

--
tejun

-

To: Tejun Heo <htejun@...>
Cc: Mark Lord <liml@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>, Jeff Garzik <jgarzik@...>
Date: Thursday, September 27, 2007 - 2:37 pm

Old IDE does it on all controllers bar a couple. So we have a very good
knowledge of what does/doesn't work. The one that needs care in old ide
is an ordering issue where a state machine reset done first causes the
drain of the I/O to hang.
-

To: Alan Cox <alan@...>
Cc: Mark Lord <liml@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>, Jeff Garzik <jgarzik@...>
Date: Thursday, September 27, 2007 - 7:32 pm

Hmmm... So, do we apply draining to all PATA? Or is ata_piix SATA
affected too?

--
tejun
-

To: Tejun Heo <htejun@...>
Cc: Alan Cox <alan@...>, Mark Lord <liml@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>, Jeff Garzik <jgarzik@...>
Date: Thursday, September 27, 2007 - 11:52 pm

ata_piix SATA is definitely affected when a PATA_drive to SATA_host bridge is present.
Possibly other times.

Cheers

-

To: Tejun Heo <htejun@...>
Cc: Alan Cox <alan@...>, Mark Lord <liml@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Thursday, September 27, 2007 - 7:42 pm

I would think all SFF controllers, since a lot of first gen SATA are
really bridged solutions. If they are flagging DRQ, I say oblige them :)

Jeff

-

To: Jeff Garzik <jgarzik@...>
Cc: Alan Cox <alan@...>, Mark Lord <liml@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Thursday, September 27, 2007 - 7:52 pm

Alright, then the posted patch should be good enough. Mark, can you be
bothered to regenerate the patch and post it one more time (again)? It
seems we all agree the update is needed.

Thanks a lot.

--
tejun
-

To: Tejun Heo <htejun@...>
Cc: Jeff Garzik <jgarzik@...>, Alan Cox <alan@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Thursday, September 27, 2007 - 11:56 pm

I think this original patch still applies cleanly on at least 2.6.23-rc7.

Drain up to 512 words from host/bridge FIFO on stuck DRQ HSM violation,
rather than just getting stuck there forever.

Signed-Off-By: Mark Lord <mlord@pobox.com>
---

--- old/drivers/ata/libata-sff.c 2007-04-26 12:02:46.000000000 -0400
+++ linux/drivers/ata/libata-sff.c 2007-04-29 08:29:27.000000000 -0400
@@ -413,6 +413,24 @@
ap->ops->irq_on(ap);
}

+static void ata_drain_fifo (struct ata_port *ap, struct ata_queued_cmd *qc)
+{
+ u8 stat = ata_chk_status(ap);
+ /*
+ * Try to clear stuck DRQ if necessary.
+ */
+ if ((stat & ATA_DRQ) && (!qc || qc->dma_dir != DMA_TO_DEVICE)) {
+ unsigned int i, limit = 512;
+ printk("Draining up to %u words from data FIFO.\n", limit);
+ for (i = 0; i < limit ; ++i) {
+ ioread16(ap->ioaddr.data_addr);
+ if (!(ata_chk_status(ap) & ATA_DRQ))
+ break;
+ }
+ printk("Drained %u/%u words.\n", i, limit);
+ }
+}
+
/**
* ata_bmdma_drive_eh - Perform EH with given methods for BMDMA controller
* @ap: port to handle error for
@@ -469,7 +487,7 @@
}

ata_altstatus(ap);
- ata_chk_status(ap);
+ ata_drain_fifo(ap, qc);
ap->ops->irq_clear(ap);

spin_unlock_irqrestore(ap->lock, flags);
-

To: Mark Lord <liml@...>
Cc: Tejun Heo <htejun@...>, Jeff Garzik <jgarzik@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Friday, September 28, 2007 - 6:27 am

ap->ops->cleanup();

might be wiser
-

To: Alan Cox <alan@...>
Cc: Mark Lord <liml@...>, Tejun Heo <htejun@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Friday, September 28, 2007 - 9:05 pm

Though I have queued Mark's patch to be applied, my gut feeling would

If someone needs that, they can override the error handler with their
own. No need for a new op.

Jeff

-

To: Jeff Garzik <jgarzik@...>
Cc: Mark Lord <liml@...>, Tejun Heo <htejun@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Saturday, September 29, 2007 - 2:28 am

PDC202xx needs
-

To: Alan Cox <alan@...>
Cc: Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Saturday, September 29, 2007 - 8:34 am

Alan, you're the expert there (my condolences!).
Can you generate a fix for the PDC202xx to go with this?

Cheers
-

To: Tejun Heo <htejun@...>
Cc: Alan Cox <alan@...>, Jeff Garzik <jgarzik@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Friday, September 28, 2007 - 9:41 am

Actually, I belileve we should base it on qc->sect_size instead.

Then, if somebody also would like to submit a patch introducing
a cleanup() method, then please do so!

As a separate patch, though (seems to be the "libata way").
* * * *

I think this original patch still applies cleanly on at least 2.6.23-rc7.

Drain up to 512 words from host/bridge FIFO on stuck DRQ HSM violation,
rather than just getting stuck there forever.

Signed-off-by: Mark Lord <mlord@pobox.com>
---

--- old/drivers/ata/libata-sff.c 2007-09-28 09:29:22.000000000 -0400
+++ linux/drivers/ata/libata-sff.c 2007-09-28 09:39:44.000000000 -0400
@@ -420,6 +420,28 @@
ap->ops->irq_on(ap);
}

+static void ata_drain_fifo(struct ata_port *ap, struct ata_queued_cmd *qc)
+{
+ u8 stat = ata_chk_status(ap);
+ /*
+ * Try to clear stuck DRQ if necessary,
+ * by reading/discarding up to two sectors worth of data.
+ */
+ if ((stat & ATA_DRQ) && (!qc || qc->dma_dir != DMA_TO_DEVICE)) {
+ unsigned int i;
+ unsigned int limit = qc ? qc->sect_size : ATA_SECT_SIZE;
+
+ printk(KERN_WARNING "Draining up to %u words from data FIFO.\n",
+ limit);
+ for (i = 0; i < limit ; ++i) {
+ ioread16(ap->ioaddr.data_addr);
+ if (!(ata_chk_status(ap) & ATA_DRQ))
+ break;
+ }
+ printk(KERN_WARNING "Drained %u/%u words.\n", i, limit);
+ }
+}
+
/**
* ata_bmdma_drive_eh - Perform EH with given methods for BMDMA controller
* @ap: port to handle error for
@@ -476,7 +498,7 @@
}

ata_altstatus(ap);
- ata_chk_status(ap);
+ ata_drain_fifo(ap, qc);
ap->ops->irq_clear(ap);

spin_unlock_irqrestore(ap->lock, flags);
-

To: Mark Lord <liml@...>
Cc: Tejun Heo <htejun@...>, Alan Cox <alan@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Saturday, September 29, 2007 - 2:24 am

applied, after hand-editing out the top of the message, so that it would
not be copied into the kernel changelog

-

To: Mark Lord <liml@...>
Cc: Jeff Garzik <jgarzik@...>, Alan Cox <alan@...>, Andrew Morton <akpm@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Friday, September 28, 2007 - 5:48 am

Acked-by: Tejun Heo <htejun@gmail.com>

--
tejun
-

To: Tejun Heo <htejun@...>
Cc: Mark Lord <liml@...>, Jeff Garzik <jgarzik@...>, Alan Cox <alan@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Friday, September 28, 2007 - 5:56 am

Nacked-by: scripts/checkpatch.pl
-

To: Andrew Morton <akpm@...>
Cc: Mark Lord <liml@...>, Jeff Garzik <jgarzik@...>, Alan Cox <alan@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Friday, September 28, 2007 - 6:01 am

> Nacked-by: scripts/checkpatch.pl

Mark, it seems you'll have to get ACK from this dude first. :-)

--
tejun
-

To: Mark Lord <liml@...>
Cc: <htejun@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Wednesday, September 5, 2007 - 3:38 pm

Hey, we just found something which doesn't crash my Vaio!

sony:/home/akpm/hdparm-7.7> 0 ./hdparm --drq-hsm-error /dev/sda

/dev/sda:
triggering "stuck DRQ" host state machine error
do_drq_hsm_error: Success
ata status=0x58 ata error=0x00

ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/40 tag 0 cdb 0x0 data 0
res 58/00:01:00:00:00/00:00:00:00:00/40 Emask 0x2 (HSM violation)
ata3: soft resetting port
ata3.00: configured for UDMA/100
ata3: EH complete
sd 2:0:0:0: [sda] 195371568 512-byte hardware sectors (100030 MB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

How dull. (ata_piix)
-

To: Andrew Morton <akpm@...>
Cc: <htejun@...>, <michal.k.k.piotrowski@...>, <bryan@...>, <linux-kernel@...>, <linux-ide@...>
Date: Wednesday, September 5, 2007 - 7:03 pm

On my two very similar notebooks, it crashes libata when a PATA drive is used
behind a Marvell converter chip, but not when a SATA drive is used directly.

Cheers
-

Previous thread: Re: RFC: issues concerning the next NAPI interface by Mitchell Erblich on Friday, August 24, 2007 - 10:10 pm. (1 message)

Next thread: PROBLEM: Caught SIGFPE exceptions aren't reset by Clark Cooper on Saturday, August 25, 2007 - 12:40 am. (2 messages)