Hi, I'm resending (didn't see my first attempt appear on the maillist):
I'm having serious disk-issues when using the on-board nvidia controller
for my HDDs (My motherboard is a Gigabyte GA-N650SLI-DS4 with nvidia
chipset, cpu is intel Core2Quad)excerpt from "lspci":
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller
(rev a1)
00:0f.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller
(rev a1)I have a normal IDE/P-ATA-disk attached to the "IDE"-controller and that
works fine (/dev/hda)However, any number of disks (I have tried 2 and 4) connected to the
SATA-controller(s), will eventually fail. - See attached log (excerpt /
anything relevant from /var/log/messages)At first, disks were REALLY unstable, but then I disabled S.M.A.R.T.
(both in BIOS and Linux), and I updated from the CentOS5 (equivalent of
RHEL5) kernel (2.6.18) to the latest (at that time) official kernel from
kernel.org:> uname -a
Linux mirakel 2.6.22.5-custom_jir #2 SMP Thu Aug 30 22:06:21 CEST 2007
i686 i686 i386 GNU/LinuxNow it will normally take a day or two before SATA crashes, so things
are better, but still rather useless.First error when sata_nv get into problems is always:
"exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen"
(as shown in the attached log-file.) - when this happens to one device,
it'll almost instantly happen to the other disk attached to that
controller as well. A couple of minutes (or so) later, the disk(s)
connected to the other controller will start acting up as well (in the
same manner). - I/O freezes, and nothing helps except a reboot...As I run a rather large (software / md) RAID-5 disk array on this server
(I'm doing a bit of video editing), every crash means a time-consuming
rebuild of the disk-array...I have given up on the sata_nv / nvidia-controllers for the time being.
I now resort to some old PCI-connected sata-controllers which work fine
(but...
does adma=0 module option do anything?
Jeff
-
Thanks for the suggestion, but sata_nv is not built modular in my
current kernel, so "no can do" at the moment
(However, if some expert REALLY thinks this will fix things, I will
CERTAINLY recompile and give it a go)As I said before, it all works for some time (a day or two) before it
crashes with the current kernel & no "S.M.A.R.T.". With my current setup
I have always had the time to fully rebuild my disk-array before a new
crash. - In the case of 4 disks attached to the nvidia controllers
(disregarding the disks on other controllers), this means that the
sata_nv-driver / controllers alone have read at least 750GB and written
250GB of data before the crash (with no resets working) - soft reboot
fixes everything. - I'm pretty confident that this is a driver issue.As Tejun Heo <htejun@gmail.com> writes "the whole controller seems to
have went down at once and it's not even IRQ routing problem - resets
are failing."The error-messages / crash-symptoms were the same with SMART enabled and
the original CentOS5-kernel, except that with that setup, the crashes
were much more frequent.Any help?
BR
Jon Ivar-
Passing "sata_nv.adma=0" as kernel boot parameter will do the trick.
--
tejun
-
Resending, as my first attempts contained HTML and was blocked...
Ahh, silly me... Of course!
Ooops, I just got back, and verified: I actually have sata_nv running as
a module after all on this server... My bad.
I fixed /etc/modprobe.conf to include the following two lines:
"
alias scsi_hostadapter sata_nv
options sata_nv adma=0
...
"I then ran "mkinitrd" (to ensure that the latest options from
modprobe.conf were included) in the initrd-image that I load at boot.- Do you guys think this is worth a try? Anyway, I have rebooted now, so
I'll test it for a few days and let you know - We'll just have to wait
and see...
Do you think I should re-enable SMART to provoke a failure, or would
that be to tempt fate too much? (For now I have not re-enabled SMART)PS: Is there any way of testing / verifying that sata_nv is now running
with this option? - I am pretty sure I have done it correctly, but I
would still like to confirm that the proper option has been passed if
possible.Thanks
Jon Ivar-
I don't think it will matter, as adma doesn't affect MCP51, but only nforce=
4.=20
So I'd look for other trouble makers.
=2D-=20
(=B0=3D =3D=B0)
//\ Prakash Punnoor /\\
V_/ \_V
Robert told me. (And you're correct - It didn't help).
I'm going to test another (identical) motherboard this evening to
establish whether it could be a HW-issue.I'll keep you posted
Jon Ivar
-
Hi, I'm getting inmore confident that the driver is the issue.
I have now been able to reproduce the same error on the new motherboard
as well... - (the same MB was tested to work in Windows with
windows-drivers)...Unless you guys can come up with something clever, I'll see if I can get
my hands on / change to another (non-nvidia) chipset in a day or two, as
the sata_nv with this chipset apparently isn't working.(Or have anyone EVER been successful with the latest kernel/driver on
this HW)?Attaching everything relevant from /var/log/messages...
Jon Ivar
I don't have exaclty the same hw, but the same chipset and I don't have any=
=20
problems - even with the swncq patch applied. Do you have an hpet? If not,=
=20
try booting with acpi_use_time_override. My system won't work with skipping=
=20
the override.=2D-=20
(=B0=3D =3D=B0)
//\ Prakash Punnoor /\\
V_/ \_V
Hi , I reconnected and rebooted with the kernel option
"acpi_use_timer_override" (this is the correct spelling, isn't it? -
Kernel didn't complain.). Didn't help, the same error received as
before. - I'll have to connect all disks back to my PCI-connected SATA
controllers and start rebuilding my RAID yet again.It seems random which disk is first affected (This far, I know that it
has happened to ata1, ata3 and ata4, three of my potential disks) - I
guess it just happens to the disk that is being used at the moment when
the driver / controller acts up.)I'm about to give in. I think I'll try to replace both ( Gigabyte
GA-N650SLI-DS4 ) motherboards, as the driver simply isn't working for
the on-board controller of these boards. Could be a combination of the
controllers and some other HW on the motherboards of course, but all is
working when I connect all disks to my non-nvidia controllers. - Guess
I'll opt for a motherboard with an intel-chipset after all...BR
Jon Ivar
-
>>>>> "Jon" == Jon Ivar Rykkelid <jonry@pvv.org> writes:
Jon> Hi , I reconnected and rebooted with the kernel option
Jon> "acpi_use_timer_override" (this is the correct spelling, isn't
Jon> it? - Kernel didn't complain.). Didn't help, the same error
Jon> received as before. - I'll have to connect all disks back to my
Jon> PCI-connected SATA controllers and start rebuilding my RAID yet
Jon> again.What happens when you just have ONE disk connected to the motherboard
controller, and the rest connected to PCI controllers? Does it crap
out then? You've just such a nice repeatable problem across
motherboards that it's a shame to waste this debugging time.I'm wondering if it's a PCI bus issue somehow, and that the load on
the motherboard controller isn't supportable when you have a bunch of
disks on PCI controllers as well. Shot in the dark...Thanks for all your hard work on this, I know how frustrating it is to
not have a stable system!John
-
Sorry, I gave in. I have now abandoned my nvidia trials (both
motherboards have been returned, and I'm now running with Intel chipset)
- My current motherboard is less ideal (in terms of PCI-slots etc.), but
That was actually not such a bad idea... Unfortunately it's too late now
(If not I should have tested for sure). I was/am after all running an
8-disk SATA array (plus a normal IDE disk - not in the raid). I had 4
disks running through two PCI-cards and 4 disks used the motherboard's
controller. - When all 8 disks were connected to the two PCI-cards the
speed dropped compared to when the motherboard's controller took some
load.. (So it could maybe be an issue with bandwidth / load ? - I don't
Sorry for giving in, but I felt I was banging my head against the wall
(and with too few sensible solutions being suggested). Now I guess I'm
semi-happy that all seems to work OK with the Intel chipset..
Frustrating that the sata_nv-driver / nvidia HW didn't work with my
configuration, though...Thank you all for your effort as well - hope someone figures this out
sometime in the future.All the best
Jon Ivar
-
Not just motherboard. It is more likely to be a cable, drive or PSU
problem.Jeff
-
I don't think it's cable as the problem occurs on multiple ports. My
bet is either the controller or PSU.Thanks.
--
tejun
-
Hi,
I now tested with the adma=0 option, but if anything I got a crash
quicker than before. Same error message started coming in, but this time
the system hung before I was able to capture the log as well (but I saw
the error, and it was the same as before, except that this time it was
the ata3-channel that first started acting up..) - To remind you all
what this is about, I have reattached the log that I originally captured...Any help / clever suggestions is appreciated.
Jon Ivar
Sounds like a hardware problem, since disabling ADMA is generally the
cure-all we use -- it appears to stress the hardware less.Jeff
-
If this is an MCP51 chipset, adma=0 will make no difference since that
chipset does not support ADMA in the first place.--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/-
Hi,
To eliminate the possibility of this being a hardware issue, I have now
acquired another "Gigabyte GA-N650SLI-DS4" motherboard (with the "MCP51"
chipset) for testing. I'll swap parts this evening. Hopefully I'll be
able to tell you in a few hours whether this appears to be working as it
should. The motherboard that I'm going to swap to has actually been
tested (with MS Windows OS+driver) for more than a day with a disk
connected, so if this MB also fails, I think it will be safe to say that
the issue is with the sata_nv driver... So hang on.(You can't think of something else that could conflict with the sata_nv
driver after a bit of time, like two of my raid-disks being encrypted,
me running a SW raid-5 array / some special HW (quad-core CPU) / me
running vmware on this server ... ? - To me, all these suggestions seems
rather far fetched, especially as all is working with another
controller, so I'm arguing that unless there's a HW issue, the issue is
with the driver, but you're the expert(s), so let me know if you differ.)I'll keep you posted as to the result of swapping HW.. Give me a few
hours. :-)BR
Jon Ivar-
Is this the general opinion? - Should I try to get a replacement
motherboard of the same type?If so, can anyone confirm that the sata_nv-driver is working with the
Gigabyte GA-N650SLI-DS4 motherboard at all / have anyone been successful
with this MB? How about the MCP51 SATA controller? - Can anyone confirm
that the driver is working for this HW? I would feel awkward to try to
claim a warranty replacement if it is proved that the HW is OK after
all, and the problem is with the linux-driver...BR
Jon Ivar--
Jon Ivar Rykkelid Web: http://www.pvv.org/~jonry
Enromvegen 191 Phone: +47 72 56 86 86
N-7026 Trondheim Mob.: +47 906 20 250
Norway Email: jonry@pvv.org-
