amd64 sata_nv (massive) memory corruption

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Linas Vepstas
Date: Friday, August 1, 2008 - 10:30 am

Hi,

I'm seeing strong, easily reproducible (and silent) corruption on a
sata-attached
disk drive on an amd64 board.  It might be the disk itself, but I
doubt it; googling
suggests that its somehow iommu-related but I cannot confirm this.

quickie summary:
-- disk is a brand new WDC WD5000AAKS-00YGA0 500GB disk (well, it
    was brand new a few months ago -- unusued, at any rate)
-- passes smartmon with flying colors, including many repeated short and long
   self-tests. Been passing for months.  No hint of bad sectors or other errors
   in smartctl -a display
-- no ide, sata errors in syslog -- no block device errors, no fs errors, etc.
-- No oopses anywhere to be found
-- system works flawlessly with an old PATA disk. (although I'm running it
   with dma turned off with hdparm, out of paranoia)
-- system is amd64 dual core, ASUS M2N-E mobo, 4GB RAM
   Northbridge is nVidia Corporation MCP55 Memory Controller (rev a3)
-- I tried moving the sata cable around to other ports, no effect; also tried
   reseating it on hard drive, no effect.

corruption is *easily* observed copying files with cp or dd. Also, typically
filesystem metadata is corrupted too. Creating even a small ext2 filesystem,
say 1GB, then copying 300MB of files onto it, unmounting it, and running fsk
will return many dozens of errors. Rerunning e2fsck over and over (as
e2fsck -f -y /dev/sda6) will report new errors about 1 out of every 3 times
(on small fs'es -- on big one's it will find new errors every time)

This behaviour has been observed with two different kernels:
with 2.6.23.9, compiled for 32-bit, and also 2.6.26 complied
for 64-bit.

Googling this uncovers some Dec 2006 LKML emails suggesting an
iommu problem, which I explored:
-- My default boot complains
    Your BIOS doesn't leave a aperture memory hole
    Please enable the IOMMU option in the BIOS setup
    This costs you 64 MB of RAM
-- I cannot find any option in BIOS that even vaguely hints at IOMMU-like
    function; at best, I can assign interrupts to PCI slots, but
that's it.  There's
    a bunch of IO options for olde-fashioned superio-like stuff: serial,parallel
    ports, USB stuff, etc. but that's all.
-- booting with iommu=soft does get rid of the aperature memory hole
   messsage, but does not solve the corruption problem.
-- booting with iommu=force seems to have no effect.

I'm running the powernow-k8 cpu frequency regulator. On a hunch,
I wondered if this might be the source of the problem; however,
using the "performance" regulator to keep the clock speed nailed
at maximum had no effect on the corruption bug.

Also of note:
-- problem was observed earlier, when system had 3GB RAM in it.
-- The integrated nvidia ethernet seems to work great, no errors, etc.
-- A different PCI ethernet card works great too.
-- I'm running graphics on an anceint matrox card in a PCI slot, and
    there's no hint of trouble there.
-- I'm using this system as my day-to-day desktop, and there seem to
   be no other problems. This suggests that if its some pci iommu
   wackiness, it certainly not affecting anything that isn't sata.

I really doubt the problem is the hard-drive; but I'll have to buy another
one to rule this out. Its possible that there's some problem with the
sata_nv driver, but there have been historical reports of corruption
on amd64 with other sata controllers. I can buy another sata controller
if needed, to experiment.

Other than that, any ideas for any further experiments? What can
I do to narrow the problem?

-- Linas Vepstas
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
amd64 sata_nv (massive) memory corruption, Linas Vepstas, (Fri Aug 1, 10:30 am)
Re: amd64 sata_nv (massive) memory corruption, John Stoffel, (Fri Aug 1, 1:51 pm)
Re: amd64 sata_nv (massive) memory corruption, Alistair John Strachan, (Fri Aug 1, 3:19 pm)
Re: amd64 sata_nv (massive) memory corruption, Linas Vepstas, (Fri Aug 1, 7:51 pm)
Re: amd64 sata_nv (massive) memory corruption, Linas Vepstas, (Fri Aug 1, 8:06 pm)
Re: amd64 sata_nv (massive) memory corruption, John Stoffel, (Sat Aug 2, 1:09 pm)
Re: amd64 sata_nv (massive) memory corruption, Roger Heflin, (Sat Aug 2, 2:55 pm)
Re: amd64 sata_nv (massive) memory corruption, Linas Vepstas, (Sat Aug 2, 3:01 pm)
Re: amd64 sata_nv (massive) memory corruption, John Stoffel, (Sat Aug 2, 7:41 pm)
Re: amd64 sata_nv (massive) memory corruption, Alan Cox, (Sun Aug 3, 3:16 pm)
Re: amd64 sata_nv (massive) memory corruption, Linas Vepstas, (Sun Aug 3, 3:23 pm)
Re: amd64 sata_nv (massive) memory corruption, Linas Vepstas, (Tue Aug 5, 10:02 am)
Re: amd64 sata_nv (massive) memory corruption, Alan Cox, (Tue Aug 5, 10:21 am)
Re: amd64 sata_nv (massive) memory corruption, Linas Vepstas, (Wed Aug 6, 2:33 pm)
Re: amd64 sata_nv (massive) memory corruption, Martin K. Petersen, (Wed Aug 6, 7:59 pm)
Re: amd64 sata_nv (massive) memory corruption, Linas Vepstas, (Wed Aug 6, 9:32 pm)
Re: amd64 sata_nv (massive) memory corruption, Pavel Machek, (Thu Aug 7, 12:45 am)
Re: amd64 sata_nv (massive) memory corruption, Martin K. Petersen, (Thu Aug 7, 9:42 am)
Re: amd64 sata_nv (massive) memory corruption, Linas Vepstas, (Thu Aug 7, 10:23 am)
Re: amd64 sata_nv (massive) memory corruption, John Stoffel, (Thu Aug 7, 11:53 am)