Re: sata_nv issues with MCP51 SATA controller

Previous thread: FW: Kernel message: swapper: page allocation failure. order:1, mode:0x20 by L.P.H. van Belle on Thursday, September 13, 2007 - 3:29 am. (1 message)

Next thread: Bad hotplug/scheduler interaction? by Rick Lindsley on Thursday, September 13, 2007 - 3:52 am. (1 message)
To: <linux-kernel@...>
Date: Thursday, September 13, 2007 - 3:46 am

Hi, I'm resending (didn't see my first attempt appear on the maillist):

I'm having serious disk-issues when using the on-board nvidia controller
for my HDDs (My motherboard is a Gigabyte GA-N650SLI-DS4 with nvidia
chipset, cpu is intel Core2Quad)

excerpt from "lspci":
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller
(rev a1)
00:0f.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller
(rev a1)

I have a normal IDE/P-ATA-disk attached to the "IDE"-controller and that
works fine (/dev/hda)

However, any number of disks (I have tried 2 and 4) connected to the
SATA-controller(s), will eventually fail. - See attached log (excerpt /
anything relevant from /var/log/messages)

At first, disks were REALLY unstable, but then I disabled S.M.A.R.T.
(both in BIOS and Linux), and I updated from the CentOS5 (equivalent of
RHEL5) kernel (2.6.18) to the latest (at that time) official kernel from
kernel.org:

> uname -a
Linux mirakel 2.6.22.5-custom_jir #2 SMP Thu Aug 30 22:06:21 CEST 2007
i686 i686 i386 GNU/Linux

Now it will normally take a day or two before SATA crashes, so things
are better, but still rather useless.

First error when sata_nv get into problems is always:
"exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen"
(as shown in the attached log-file.) - when this happens to one device,
it'll almost instantly happen to the other disk attached to that
controller as well. A couple of minutes (or so) later, the disk(s)
connected to the other controller will start acting up as well (in the
same manner). - I/O freezes, and nothing helps except a reboot...

As I run a rather large (software / md) RAID-5 disk array on this server
(I'm doing a bit of video editing), every crash means a time-consuming
rebuild of the disk-array...

I have given up on the sata_nv / nvidia-controllers for the time being.
I now resort to some old PCI-connected sata-controllers which work fine
(but...

To: Jon Ivar Rykkelid <jonry@...>
Cc: <linux-kernel@...>
Date: Thursday, September 13, 2007 - 10:20 am

does adma=0 module option do anything?

Jeff

-

To: Jeff Garzik <jeff@...>
Cc: <linux-kernel@...>, Tejun Heo <htejun@...>
Date: Thursday, September 13, 2007 - 11:05 am

Thanks for the suggestion, but sata_nv is not built modular in my
current kernel, so "no can do" at the moment
(However, if some expert REALLY thinks this will fix things, I will
CERTAINLY recompile and give it a go)

As I said before, it all works for some time (a day or two) before it
crashes with the current kernel & no "S.M.A.R.T.". With my current setup
I have always had the time to fully rebuild my disk-array before a new
crash. - In the case of 4 disks attached to the nvidia controllers
(disregarding the disks on other controllers), this means that the
sata_nv-driver / controllers alone have read at least 750GB and written
250GB of data before the crash (with no resets working) - soft reboot
fixes everything. - I'm pretty confident that this is a driver issue.

As Tejun Heo <htejun@gmail.com> writes "the whole controller seems to
have went down at once and it's not even IRQ routing problem - resets
are failing."

The error-messages / crash-symptoms were the same with SMART enabled and
the original CentOS5-kernel, except that with that setup, the crashes
were much more frequent.

Any help?

BR
Jon Ivar

-

To: Jon Ivar Rykkelid <jonry@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>
Date: Thursday, September 13, 2007 - 11:14 am

Passing "sata_nv.adma=0" as kernel boot parameter will do the trick.

--
tejun
-

To: Tejun Heo <htejun@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, Robert Hancock <hancockr@...>
Date: Thursday, September 13, 2007 - 2:01 pm

Resending, as my first attempts contained HTML and was blocked...

Ahh, silly me... Of course!
Ooops, I just got back, and verified: I actually have sata_nv running as
a module after all on this server... My bad.
I fixed /etc/modprobe.conf to include the following two lines:
"
alias scsi_hostadapter sata_nv
options sata_nv adma=0
...
"

I then ran "mkinitrd" (to ensure that the latest options from
modprobe.conf were included) in the initrd-image that I load at boot.

- Do you guys think this is worth a try? Anyway, I have rebooted now, so
I'll test it for a few days and let you know - We'll just have to wait
and see...
Do you think I should re-enable SMART to provoke a failure, or would
that be to tempt fate too much? (For now I have not re-enabled SMART)

PS: Is there any way of testing / verifying that sata_nv is now running
with this option? - I am pretty sure I have done it correctly, but I
would still like to confirm that the proper option has been passed if
possible.

Thanks
Jon Ivar

-

To: Jon Ivar Rykkelid <jonry@...>
Cc: Tejun Heo <htejun@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, Robert Hancock <hancockr@...>
Date: Friday, September 14, 2007 - 9:29 am

I don't think it will matter, as adma doesn't affect MCP51, but only nforce=
4.=20
So I'd look for other trouble makers.
=2D-=20
(=B0=3D =3D=B0)
//\ Prakash Punnoor /\\
V_/ \_V

To: Prakash Punnoor <prakash@...>
Cc: Tejun Heo <htejun@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, Robert Hancock <hancockr@...>
Date: Friday, September 14, 2007 - 10:17 am

Robert told me. (And you're correct - It didn't help).

I'm going to test another (identical) motherboard this evening to
establish whether it could be a HW-issue.

I'll keep you posted

Jon Ivar

-

To: Robert Hancock <hancockr@...>
Cc: Prakash Punnoor <prakash@...>, Tejun Heo <htejun@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>
Date: Friday, September 14, 2007 - 4:35 pm

Hi, I'm getting inmore confident that the driver is the issue.

I have now been able to reproduce the same error on the new motherboard
as well... - (the same MB was tested to work in Windows with
windows-drivers)...

Unless you guys can come up with something clever, I'll see if I can get
my hands on / change to another (non-nvidia) chipset in a day or two, as
the sata_nv with this chipset apparently isn't working.

(Or have anyone EVER been successful with the latest kernel/driver on
this HW)?

Attaching everything relevant from /var/log/messages...

Jon Ivar

To: Jon Ivar Rykkelid <jonry@...>
Cc: Robert Hancock <hancockr@...>, Tejun Heo <htejun@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>
Date: Saturday, September 15, 2007 - 3:12 am

I don't have exaclty the same hw, but the same chipset and I don't have any=
=20
problems - even with the swncq patch applied. Do you have an hpet? If not,=
=20
try booting with acpi_use_time_override. My system won't work with skipping=
=20
the override.

=2D-=20
(=B0=3D =3D=B0)
//\ Prakash Punnoor /\\
V_/ \_V

To: <linux-kernel@...>
Cc: Prakash Punnoor <prakash@...>, Robert Hancock <hancockr@...>, Tejun Heo <htejun@...>, Jeff Garzik <jeff@...>
Date: Saturday, September 15, 2007 - 6:14 am

Hi , I reconnected and rebooted with the kernel option
"acpi_use_timer_override" (this is the correct spelling, isn't it? -
Kernel didn't complain.). Didn't help, the same error received as
before. - I'll have to connect all disks back to my PCI-connected SATA
controllers and start rebuilding my RAID yet again.

It seems random which disk is first affected (This far, I know that it
has happened to ata1, ata3 and ata4, three of my potential disks) - I
guess it just happens to the disk that is being used at the moment when
the driver / controller acts up.)

I'm about to give in. I think I'll try to replace both ( Gigabyte
GA-N650SLI-DS4 ) motherboards, as the driver simply isn't working for
the on-board controller of these boards. Could be a combination of the
controllers and some other HW on the motherboards of course, but all is
working when I connect all disks to my non-nvidia controllers. - Guess
I'll opt for a motherboard with an intel-chipset after all...

BR
Jon Ivar
-

To: Jon Ivar Rykkelid <jonry@...>
Cc: <linux-kernel@...>, Prakash Punnoor <prakash@...>, Robert Hancock <hancockr@...>, Tejun Heo <htejun@...>, Jeff Garzik <jeff@...>
Date: Saturday, September 15, 2007 - 10:47 am

>>>>> "Jon" == Jon Ivar Rykkelid <jonry@pvv.org> writes:

Jon> Hi , I reconnected and rebooted with the kernel option
Jon> "acpi_use_timer_override" (this is the correct spelling, isn't
Jon> it? - Kernel didn't complain.). Didn't help, the same error
Jon> received as before. - I'll have to connect all disks back to my
Jon> PCI-connected SATA controllers and start rebuilding my RAID yet
Jon> again.

What happens when you just have ONE disk connected to the motherboard
controller, and the rest connected to PCI controllers? Does it crap
out then? You've just such a nice repeatable problem across
motherboards that it's a shame to waste this debugging time.

I'm wondering if it's a PCI bus issue somehow, and that the load on
the motherboard controller isn't supportable when you have a bunch of
disks on PCI controllers as well. Shot in the dark...

Thanks for all your hard work on this, I know how frustrating it is to
not have a stable system!

John
-

To: <linux-kernel@...>
Cc: John Stoffel <john@...>, Prakash Punnoor <prakash@...>, Robert Hancock <hancockr@...>, Tejun Heo <htejun@...>, Jeff Garzik <jeff@...>
Date: Saturday, September 15, 2007 - 3:29 pm

Sorry, I gave in. I have now abandoned my nvidia trials (both
motherboards have been returned, and I'm now running with Intel chipset)
- My current motherboard is less ideal (in terms of PCI-slots etc.), but
That was actually not such a bad idea... Unfortunately it's too late now
(If not I should have tested for sure). I was/am after all running an
8-disk SATA array (plus a normal IDE disk - not in the raid). I had 4
disks running through two PCI-cards and 4 disks used the motherboard's
controller. - When all 8 disks were connected to the two PCI-cards the
speed dropped compared to when the motherboard's controller took some
load.. (So it could maybe be an issue with bandwidth / load ? - I don't
Sorry for giving in, but I felt I was banging my head against the wall
(and with too few sensible solutions being suggested). Now I guess I'm
semi-happy that all seems to work OK with the Intel chipset..
Frustrating that the sata_nv-driver / nvidia HW didn't work with my
configuration, though...

Thank you all for your effort as well - hope someone figures this out
sometime in the future.

All the best
Jon Ivar
-

To: Jon Ivar Rykkelid <jonry@...>
Cc: Prakash Punnoor <prakash@...>, Tejun Heo <htejun@...>, <linux-kernel@...>, Robert Hancock <hancockr@...>
Date: Friday, September 14, 2007 - 10:25 am

Not just motherboard. It is more likely to be a cable, drive or PSU
problem.

Jeff

-

To: Jeff Garzik <jeff@...>
Cc: Jon Ivar Rykkelid <jonry@...>, Prakash Punnoor <prakash@...>, <linux-kernel@...>, Robert Hancock <hancockr@...>
Date: Friday, September 14, 2007 - 10:39 am

I don't think it's cable as the problem occurs on multiple ports. My
bet is either the controller or PSU.

Thanks.

--
tejun
-

To: Tejun Heo <htejun@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, Robert Hancock <hancockr@...>
Date: Thursday, September 13, 2007 - 3:26 pm

Hi,

I now tested with the adma=0 option, but if anything I got a crash
quicker than before. Same error message started coming in, but this time
the system hung before I was able to capture the log as well (but I saw
the error, and it was the same as before, except that this time it was
the ata3-channel that first started acting up..) - To remind you all
what this is about, I have reattached the log that I originally captured...

Any help / clever suggestions is appreciated.

Jon Ivar

To: Jon Ivar Rykkelid <jonry@...>
Cc: Tejun Heo <htejun@...>, <linux-kernel@...>, Robert Hancock <hancockr@...>
Date: Thursday, September 13, 2007 - 3:54 pm

Sounds like a hardware problem, since disabling ADMA is generally the
cure-all we use -- it appears to stress the hardware less.

Jeff

-

To: Jeff Garzik <jeff@...>
Cc: Jon Ivar Rykkelid <jonry@...>, Tejun Heo <htejun@...>, <linux-kernel@...>
Date: Thursday, September 13, 2007 - 8:37 pm

If this is an MCP51 chipset, adma=0 will make no difference since that
chipset does not support ADMA in the first place.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

-

To: Robert Hancock <hancockr@...>
Cc: Jeff Garzik <jeff@...>, Tejun Heo <htejun@...>, <linux-kernel@...>
Date: Friday, September 14, 2007 - 8:10 am

Hi,

To eliminate the possibility of this being a hardware issue, I have now
acquired another "Gigabyte GA-N650SLI-DS4" motherboard (with the "MCP51"
chipset) for testing. I'll swap parts this evening. Hopefully I'll be
able to tell you in a few hours whether this appears to be working as it
should. The motherboard that I'm going to swap to has actually been
tested (with MS Windows OS+driver) for more than a day with a disk
connected, so if this MB also fails, I think it will be safe to say that
the issue is with the sata_nv driver... So hang on.

(You can't think of something else that could conflict with the sata_nv
driver after a bit of time, like two of my raid-disks being encrypted,
me running a SW raid-5 array / some special HW (quad-core CPU) / me
running vmware on this server ... ? - To me, all these suggestions seems
rather far fetched, especially as all is working with another
controller, so I'm arguing that unless there's a HW issue, the issue is
with the driver, but you're the expert(s), so let me know if you differ.)

I'll keep you posted as to the result of swapping HW.. Give me a few
hours. :-)

BR
Jon Ivar

-

To: Jeff Garzik <jeff@...>, Tejun Heo <htejun@...>, Robert Hancock <hancockr@...>
Cc: <linux-kernel@...>
Date: Thursday, September 13, 2007 - 5:15 pm

Is this the general opinion? - Should I try to get a replacement
motherboard of the same type?

If so, can anyone confirm that the sata_nv-driver is working with the
Gigabyte GA-N650SLI-DS4 motherboard at all / have anyone been successful
with this MB? How about the MCP51 SATA controller? - Can anyone confirm
that the driver is working for this HW? I would feel awkward to try to
claim a warranty replacement if it is proved that the HW is OK after
all, and the problem is with the linux-driver...

BR
Jon Ivar

--
Jon Ivar Rykkelid Web: http://www.pvv.org/~jonry
Enromvegen 191 Phone: +47 72 56 86 86
N-7026 Trondheim Mob.: +47 906 20 250
Norway Email: jonry@pvv.org

-

Previous thread: FW: Kernel message: swapper: page allocation failure. order:1, mode:0x20 by L.P.H. van Belle on Thursday, September 13, 2007 - 3:29 am. (1 message)

Next thread: Bad hotplug/scheduler interaction? by Rick Lindsley on Thursday, September 13, 2007 - 3:52 am. (1 message)