Re: SCSI or libata problem with an RDX removable disk

Previous thread: none

Next thread: mmc/card/block.c : mmc_blk_open readonly mount bug? by sasin on Thursday, September 4, 2008 - 2:17 am. (3 messages)
From: Pascal GREGIS
Date: Thursday, September 4, 2008 - 2:54 am

Hello everyone,

I turn to you with a new problem about scsi or libata.
I have a removable disk of type RDX, connected by SATA to my linux machine.
I access it via libata on /dev/sdb, like a classical disk.
I write to it a certain time with some mounts and umounts.
After a certain number of mounting and unmounting, or maybe a certain number of hours (not clearly identified), the device refuses to be mounted, saying "You must specify the filesystem type.".
Looking to /var/log/messages I have the following logs :


Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] READ CAPACITY failed
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Sense not available.
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Write Protect is off
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Mode Sense: 00 00 00 00
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Asking for cache data failed
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Assuming drive cache: write through
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] READ CAPACITY failed
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Sense not available.
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Write Protect is off
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Mode Sense: 00 00 00 00
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Asking for cache data failed
Jul 15 14:55:54 testrdx kernel: sd 0:0:1:0: [sdb] Assuming drive cache: write through

These lines are logged again when I run cfdisk or parted or hdparm or anything
on /dev/sdb.

My system is :
linux kernel 2.6.21.1 with some patches :
- libata-start_stop_management (http://bugs.gentoo.org/attachment.cgi?id=118829)

compiled with libata.

Motherboard ICH6 family (id 2651)

Does anyone know this problem or have an idea about what it is and ...
From: Alan Cox
Date: Thursday, September 4, 2008 - 4:34 am

> Looking to /var/log/messages I have the following logs :

I need the logs where it fails not afterwards - the ones showing the
drive error and where the fs gives up. Without that it is hard to see

It would be useful to know if a 2.6.25/2.6.26 kernel without other

Insufficient information to guess right now.
--

From: Pascal GREGIS
Date: Thursday, September 4, 2008 - 6:52 am

Sorry, you're right, I didn't see these logs :
Sep  4 08:03:01 devsni1 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Sep  4 08:03:01 devsni1 kernel: ata4.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x2a data 131072 out
Sep  4 08:03:01 devsni1 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 08:03:08 devsni1 kernel: ata4: port is slow to respond, please be patient (Status 0xd0)
Sep  4 08:03:31 devsni1 kernel: ata4: port failed to respond (30 secs, Status 0xd0)
Sep  4 08:03:31 devsni1 kernel: ata4: soft resetting port
Sep  4 08:03:32 devsni1 kernel: ATA: abnormal status 0xD0 on port 0x0001d807
Sep  4 08:03:32 devsni1 last message repeated 4 times
Sep  4 08:06:14 devsni1 kernel: 
Sep  4 08:06:14 devsni1 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
Sep  4 08:06:14 devsni1 kernel: end_request: I/O error, dev sdb, sector 37700080
Sep  4 08:06:14 devsni1 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
Sep  4 08:06:14 devsni1 kernel: end_request: I/O error, dev sdb, sector 37700336
Sep  4 08:06:14 devsni1 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
Sep  4 08:06:14 devsni1 kernel: end_request: I/O error, dev sdb, sector 37700592
Sep  4 08:06:14 devsni1 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
Sep  4 08:06:14 devsni1 kernel: end_request: I/O error, dev sdb, sector 37700848
... and so on with always different sector numbers.

What is strange is that before this, I have logs looking like :

Sep  4 07:27:32 devsni1 kernel: kjournald starting.  Commit interval 5 seconds
Sep  4 07:27:32 devsni1 kernel: EXT3 FS on sdb, internal journal
Sep  4 07:27:32 devsni1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Sep  4 07:27:32 devsni1 kernel: kjournald starting.  Commit interval 5 seconds
Sep  4 07:27:32 devsni1 kernel: EXT3 FS on sdb, internal journal
Sep  4 07:27:32 devsni1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Sep  4 07:28:20 devsni1 kernel: ...
From: Alan Cox
Date: Monday, September 8, 2008 - 3:21 am

Your disk went offline and then refused to come back when the link was
reset. The initial trigger appears to have been the drive, the fact it
didn't come back could either be the drive or a controller problem. We've
seen a few cases where devices or controllers fail to recover from one
end being stuck expecting data.

Mark Lord did some patches to try and drain data in this case but I don't
remember if they were merged yet.

Alan
--

From: Mark Lord
Date: Monday, September 8, 2008 - 11:58 am

..

That would be this patch, currently not merged, not maintained,
and probably needs rework for some chipsets.  But for the record:



I think this original patch still applies cleanly on at least 2.6.23-rc7.

Drain up to 512 words from host/bridge FIFO on stuck DRQ HSM violation,
rather than just getting stuck there forever.

Signed-off-by: Mark Lord <mlord@pobox.com>
---

--- old/drivers/ata/libata-sff.c	2007-09-28 09:29:22.000000000 -0400
+++ linux/drivers/ata/libata-sff.c	2007-09-28 09:39:44.000000000 -0400
@@ -420,6 +420,28 @@
 	ap->ops->irq_on(ap);
 }
 
+static void ata_drain_fifo(struct ata_port *ap, struct ata_queued_cmd *qc)
+{
+	u8 stat = ata_chk_status(ap);
+	/*
+	 * Try to clear stuck DRQ if necessary,
+	 * by reading/discarding up to two sectors worth of data.
+	 */
+	if ((stat & ATA_DRQ) && (!qc || qc->dma_dir != DMA_TO_DEVICE)) {
+		unsigned int i;
+		unsigned int limit = qc ? qc->sect_size : ATA_SECT_SIZE;
+
+		printk(KERN_WARNING "Draining up to %u words from data FIFO.\n",
+									limit);
+		for (i = 0; i < limit ; ++i) {
+			ioread16(ap->ioaddr.data_addr);
+			if (!(ata_chk_status(ap) & ATA_DRQ))
+				break;
+		}
+		printk(KERN_WARNING "Drained %u/%u words.\n", i, limit);
+	}
+}
+
 /**
  *	ata_bmdma_drive_eh - Perform EH with given methods for BMDMA controller
  *	@ap: port to handle error for
@@ -476,7 +498,7 @@
 	}
 
 	ata_altstatus(ap);
-	ata_chk_status(ap);
+	ata_drain_fifo(ap, qc);
 	ap->ops->irq_clear(ap);
 
 	spin_unlock_irqrestore(ap->lock, flags);
--

From: Pascal GREGIS
Date: Wednesday, September 10, 2008 - 1:42 am

Hi Mark and Alan,

Thank you for your answers and for the patch Mark sent in his mail.
I had found just before this patch :
http://kerneltrap.org/mailarchive/linux-kernel/2007/9/27/324334
which seems to be different from the one you sent.

What does cuase this difference, do these two patches apply on different kernel versions or do they correspond to different revisions of the patch, one being fixed compared to the other?

Thank you

Pascal


--

From: Pascal GREGIS
Date: Monday, September 8, 2008 - 1:19 am

Hi everyone,

I posted this problem last week on this mailing list, I got an answer from Alan Cox requiring more informations.
Then when I gave those informations, I didn't get any other answer.
So I try another time to get help from some of you.

Here is my problem :
I have a Linux box with an RDX removable disk in SATA. A software uses regularly this RDX, mounts it, reads and/or writes to it and unmounts it.
But after a certain time or a certain number of uses (not clearly identified), the device fails to respond, mount displaying something like :
"There is no filesystem on this device"

In /var/log/messages I have :
Sep  4 08:03:01 devsni1 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Sep  4 08:03:01 devsni1 kernel: ata4.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x2a data 131072 out
Sep  4 08:03:01 devsni1 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 08:03:08 devsni1 kernel: ata4: port is slow to respond, please be patient (Status 0xd0)
Sep  4 08:03:31 devsni1 kernel: ata4: port failed to respond (30 secs, Status 0xd0)
Sep  4 08:03:31 devsni1 kernel: ata4: soft resetting port
Sep  4 08:03:32 devsni1 kernel: ATA: abnormal status 0xD0 on port 0x0001d807
Sep  4 08:03:32 devsni1 last message repeated 4 times
Sep  4 08:06:14 devsni1 kernel: 
Sep  4 08:06:14 devsni1 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
Sep  4 08:06:14 devsni1 kernel: end_request: I/O error, dev sdb, sector 37700080
Sep  4 08:06:14 devsni1 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
Sep  4 08:06:14 devsni1 kernel: end_request: I/O error, dev sdb, sector 37700336
Sep  4 08:06:14 devsni1 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
Sep  4 08:06:14 devsni1 kernel: end_request: I/O error, dev sdb, sector 37700592
Sep  4 08:06:14 devsni1 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
Sep  4 08:06:14 devsni1 kernel: end_request: I/O error, dev sdb, sector 37700848
... and so on with always ...
Previous thread: none

Next thread: mmc/card/block.c : mmc_blk_open readonly mount bug? by sasin on Thursday, September 4, 2008 - 2:17 am. (3 messages)