I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps growing. This is causing file corruption on the underlaying file systems as well. I can copy a group of 100 100mb files and then do a md5sum on them and 1-3 will be corrupt. If this is a drive that is bad is there anyway to run a report on the count per drive that these mismatches occur. I have run smarttools test and do not see one drive that stands out to be causing errors. Could something else be causing these errors? --
On Thu, 20 May 2010 12:02:23 -0500 When RAID5 detects an inconsistency there is no way to know which device was wrong. SMART only detects some errors, not all. I have had hard drives before which appears to have a single-bit error in their internal buffer. No error would be reported, but data you read would sometimes be wrong. RAID5 cannot help you with this sort of error. I would suggest backing up all your data (if it isn't already to late), breaking the array, and testing each device individually. e.g. create a filesystem on the device and try copying data on and reading it off. NeilBrown --
Thats what I was afraid of. The problem I have is if I back it up knowing what data is bad. Luckily it appears to be a write error because once written and correct I can do sums on all the files and I do not see anymore errors. I was thinking that there might be a way of do a resync and turning up the debug somehow so that it would log the mismatches with both the drives that it was reading from at the time. I could then take that information and considering there are 9 drives in the array the one that comes out having the most should be the culprit. I could then remove that drive from the array and test it leaving the rest in a state that could be rebuilt and the data being consistant because the drive with the bad write errors would be removed. Is this something that might be possible? Thanks, Trey --
On Thu, 20 May 2010 17:29:37 -0500 To detect a mismatch, raid5 reads from all drives in parallel, calculates the parity across the data blocks and compares that to the parity block. So no: something like that is not possible. only thing I can suggest: - add a write-intent bitmap so you can remove/re-add devices fairly cheaply - create a v.large file. - write random data to the file without truncating it. (use dd of=file conv=notrunc) then read it back and see if it matches. If it does, then this approach doesn't help. If it doesn't: 1 by 1, fail/remove a drive from the array. Write new random data to the same file and read it back and compare. Then --readd the missing device. I'm hoping that you will get an error every time except when the 'bad' device has been removed. NeilBrown --
While a bad drive is certainly a possibility here, this is precisely the
type of failure scenario that would make me suspect bad RAM,
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband
Could the cabling to the drive be causing this? (maybe failing or maybe it's partly disconnected) I don't remember at what point Linux is at implementing the checksums between the controller and the drive. --
I don't know. I'm not up on the SATA signaling details so I don't know
if it uses CRC on the signal, but I suspect it does and a bad cable
would cause failed requests. But I wouldn't bet my house on it, so I
would ask some SATA gurus.
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband
I wouldn't call myself that, but I believe PATA and SATA-level CRC errors show up in the UDMA_CRC_Error_Count SMART variable - look for a non-zero raw value in the smartctl output. This is presumably just the error-count from the drive's point of view (bad data recd at drive end). I don't know what happens with CRC errors detected at the Linux end - and whether detection is controller-dependant. Better ask on linux-ide. From the SMART attribute name, presumably the earlier PATA transfer modes don't support CRC error detection. An easy thing to check might be to reduce the libata transfer speed from 3GBps to 1.5GBps. Similarly, try to test each drive and SATA port in isolation if you can.... Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 --
ATA transfer errors should cause a bad CRC resulting in a failed transfer which will cause complaints in the kernel log. For PATA, only UDMA modes can detect CRC errors, PIO and MWDMA transfers can't. There are other places where data corruption can occur however, like inside the controller or the drive itself.. --
I have the same thought, I would remove half the RAM from the system and test again, then swap to the "other" half and repeat. Of course running memtest first is a good idea, but I have seen failures which only happen on disk access. If the system is O/C obviously the first step is to cut the speed back... -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein --
Indeed, I've seen lots of failures that only happen with disk access and
not with memory testers. Hence why I have a shell script on my web page
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband
