Re: Question on siig sata 3 controller

Previous thread: [PATCH 10/12] scsi: megaraid_sas - Add input parameter for max_sectors by Yang, Bo on Wednesday, June 9, 2010 - 9:19 pm. (5 messages)

Next thread: [PATCH] of/device: Move struct of_device define outside of CONFIG_OF_DEVICE test by Grant Likely on Wednesday, June 9, 2010 - 9:54 pm. (2 messages)
From: Alan
Date: Wednesday, June 9, 2010 - 9:39 pm

Does anyone know the status of the SIIG DP SATA 6Gb/s 2S1P PCIe (Part
number: SC-SA0E12-S1)?

I am encountering problems writing a large quantity through this
controller and I want to see if there is a way to fix this.  The pci ids
do not appear to be referenced in the kernel.

Are any of the siig sata controllers supported? Is there some issue with
them supporting Linux that I am not aware of?

Here is the lspci data:

05:00.0 SATA controller: Device 1b4b:9123 (rev 11) (prog-if 01 [AHCI 1.0])
	Subsystem: Device 1b4b:9123
	Flags: bus master, fast devsel, latency 0, IRQ 30
	I/O ports at dc00 [size=8]
	I/O ports at d880 [size=4]
	I/O ports at d800 [size=8]
	I/O ports at d480 [size=4]
	I/O ports at d400 [size=16]
	Memory at f9fff800 (32-bit, non-prefetchable) [size=2K]
	Expansion ROM at f9fe0000 [disabled] [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: ahci

05:00.1 IDE interface: Device 1b4b:91a4 (rev 11) (prog-if 8f [Master SecP
SecO PriP PriO])
	Subsystem: Device 1b4b:91a4
	Flags: fast devsel, IRQ 18
	I/O ports at d080 [size=8]
	I/O ports at d000 [size=4]
	I/O ports at cc00 [size=8]
	I/O ports at c880 [size=4]
	I/O ports at c800 [size=16]
	Memory at f9fff400 (32-bit, non-prefetchable) [size=16]
	Expansion ROM at f9fd0000 [disabled] [size=64K]
	Capabilities: <access denied>
	Kernel modules: ata_generic, pata_acpi

Thanks!

--

From: Jeff Garzik
Date: Thursday, June 10, 2010 - 1:53 am

What issues are you seeing?

The 'ahci' driver is aware of this controller...

	Jeff



--

From: alan
Date: Thursday, June 10, 2010 - 9:28 am

If you write a large amount of data to the drive (about 6-8 gigs+) the 
drive will error out and disconnect.

I will post the string of error messages when I get home.

-- 
Truth is stranger than fiction because fiction has to make sense.
--

From: Alan
Date: Thursday, June 10, 2010 - 7:08 pm

When writing large amounts of data I see messages like the following:

Jun  8 19:31:46 zowie kernel: ata2.00: exception Emask 0x0 SAct 0x3fffffff
SErr 0x0 action 0x6 frozen
Jun  8 19:31:46 zowie kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun  8 19:31:46 zowie kernel: ata2.00: cmd
61/28:00:17:fb:06/00:00:04:00:00/40 tag 0 ncq 20480 out
Jun  8 19:31:46 zowie kernel:         res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  8 19:31:46 zowie kernel: ata2.00: status: { DRDY }
Jun  8 19:31:46 zowie kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun  8 19:31:46 zowie kernel: ata2.00: cmd
61/20:08:9f:db:06/00:00:04:00:00/40 tag 1 ncq 16384 out
Jun  8 19:31:46 zowie kernel:         res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  8 19:31:46 zowie kernel: ata2.00: status: { DRDY }
Jun  8 19:31:46 zowie kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun  8 19:31:46 zowie kernel: ata2.00: cmd
61/28:10:d7:df:06/00:00:04:00:00/40 tag 2 ncq 20480 out
Jun  8 19:31:46 zowie kernel:         res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  8 19:31:46 zowie kernel: ata2.00: status: { DRDY }
Jun  8 19:31:46 zowie kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun  8 19:31:46 zowie kernel: ata2.00: cmd
61/30:18:0f:e4:06/00:00:04:00:00/40 tag 3 ncq 24576 out
Jun  8 19:31:46 zowie kernel:         res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  8 19:31:46 zowie kernel: ata2.00: status: { DRDY }
Jun  8 19:31:46 zowie kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun  8 19:31:46 zowie kernel: ata2.00: cmd
61/28:20:17:fc:06/00:00:04:00:00/40 tag 4 ncq 20480 out
Jun  8 19:31:46 zowie kernel:         res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  8 19:31:46 zowie kernel: ata2.00: status: { DRDY }
Jun  8 19:31:46 zowie kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun  8 19:31:46 zowie kernel: ata2.00: cmd
61/08:28:b7:b7:06/00:00:04:00:00/40 tag 5 ncq 4096 out
Jun  8 19:31:46 zowie ...
From: Rogier Wolff
Date: Monday, June 14, 2010 - 11:57 pm

yeah! I'm trying to write some 2.5Tb to my raid array, where 2 of 8
disks are connected to an Asus U3S6 board.
   http://www.asus.com/product.aspx?P_ID=lGYmelQ8mJvPtYTv

After a while, those two disks bomb out, and make the raid
inaccessible.

A reboot brings the disks back to life. So in theory, Linux should be
able to restore life into these drives by doing the right magic with
the hardware bits... 

I'm running 2.6.34: 

Linux version 2.6.34 (root@zebigbos) (gcc version 3.4.2) #3 SMP Mon May 17 21:04:13 CEST 2010


Log file entries: 

ata5.00: exception Emask 0x0 SAct 0xfff SErr 0x0 action 0x6 frozen
ata5.00: failed command: READ FPDMA QUEUED
ata5.00: cmd 60/a8:00:f6:12:10/00:00:0d:00:00/40 tag 0 ncq 86016 in
         res 40/00:14:ee:98:bb/00:00:0a:00:00/40 Emask 0x4 (timeout)
ata5.00: status: { DRDY }
...
ata5.00: failed command: READ FPDMA QUEUED
ata5.00: cmd 60/a0:58:ee:19:10/00:00:0d:00:00/40 tag 11 ncq 81920 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 370)
ata5.00: configured for UDMA/133
ata5.00: device reported invalid CHS sector 0
*last message repeated 10 times
ata5: EH complete

(all tags 1...10 are aalso listed.)

This seems "harmless", it happend a few times the last hour or so
(during the rebuild). 

When things went bad last time I got: 

one of these "harmless events" (but this time with 31 tags listed!): 

Jun 14 18:26:23 vercingetorix kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 370)

and then 5 seconds later: 

ata5.00: qc timeout (cmd 0xec)
ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata5.00: revalidation failed (errno=-5)
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 370)
ata5.00: qc timeout (cmd 0xec)
ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4)


	Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 ...
From: Alan Cox
Date: Tuesday, June 15, 2010 - 3:07 am

We don't have power control of the drives. If the firmware crashes or a
drive flakes out due to power problems or something similar occurs its

We tried the biggest hammer we had

Alan
--

From: Rogier Wolff
Date: Tuesday, June 15, 2010 - 7:53 am

The thing is, the power didn't cycle. I just typed "reboot" from a
remote location. (Yes, in most cases leading up to yesterday's/this
morning's event I thought I had to powercycle to bring them back, but
I tried "just the reboot" this morning and it worked!)

The controller has TWO drives connected. BOTH drives became
inaccessible at exactly the same point in time. This has happened
before, with BOTH drives disappearing at the same moment.

The RAID superblocks on BOTH drives had info like: 
      RAID disk 1/8, raid is up 8/8
say for disk numbers 1,2. 

All six other drives had
      RAID disk 4/8, raid is broken 6/8
say, for disk numbers 0, 3,4,5,6,7

Next time this happens, I'll try removing and reinserting all the sata
modules (the machine is a file-server. It's NFS-root so it doesn't
depend on the storage modules for it's root fs.... :-) )

sata_nv                20758  0 
ahci                   36037  6 

Is one of these modules the driver for this controller? I think it's
AHCI: lshw says it uses ports cc00 ... and a bunch of others, and
those ports are claimed by ahci according to /proc/ioports. Ah! I need

Not big enough! De BIOS manages a bigger one!

	Roger.

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
**    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. 
Does it sit on the couch all day? Is it unemployed? Please be specific! 
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
--

From: Alan Cox
Date: Tuesday, June 15, 2010 - 8:01 am

AHCI will be driving it.
--

From: Alan
Date: Tuesday, June 15, 2010 - 10:25 am

I have seen this problem with the 2.6.33 kernel in Fedora 13. The
problem goes away in 2.6.35-rc3. (Though networking is fubared for me on
that kernel, so I have not migrated to it.)

My understanding is the "fix" in the driver was to blacklist ncq for
that controller. I have not verified that yet.

--

From: Rogier Wolff
Date: Tuesday, August 10, 2010 - 6:48 am

One of my disks died again a while ago. So I went to the machine to
replace the drive. But I forgot to write down which one had died. So I
started it up again. Now I had 7 disks again like before, but a
different drive was now "gone". So my RAID had only 6 out of 8 drives
and was "gone". Together with some 4.7T worth of data on it.... 

Next I went to the machine with a spare sata card. I removed the
drives from the ASUS U3S6 card, and put them on the old pci sata card. 

By the time I logged in on the machine, the RAID had found 8/8 drives
and I think it had already started rebuilding..... 

I now haven't had any problems with the drives in more than a week.

Performance of the raid has dropped from 600Mb to around 400Mb/sec,
obviously because the PCI card cannot handle 200Mb/sec of disk IO. 

I'm open to suggestions for cheap highperformance WORKING PCIe sata
cards.... 

	Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
**    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. 
Does it sit on the couch all day? Is it unemployed? Please be specific! 
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
--

From: alan
Date: Wednesday, August 11, 2010 - 5:27 pm

I found that if I ran the latest of Linus' kernels, the controller worked 
correctly.  There is obviously a change that needs to get backported into 
the other working kernels.

-- 
Truth is stranger than fiction because fiction has to make sense.
--

Previous thread: [PATCH 10/12] scsi: megaraid_sas - Add input parameter for max_sectors by Yang, Bo on Wednesday, June 9, 2010 - 9:19 pm. (5 messages)

Next thread: [PATCH] of/device: Move struct of_device define outside of CONFIG_OF_DEVICE test by Grant Likely on Wednesday, June 9, 2010 - 9:54 pm. (2 messages)