Re: raid 5 mismatch_cnt errors

Previous thread: raid 5 mismatch_cnt errors by Trey Scarborough on Thursday, May 20, 2010 - 9:58 am. (1 message)

Next thread: blkid and partition problem by Tim Bostrom on Friday, May 21, 2010 - 5:28 pm. (4 messages)
From: Trey Scarborough
Date: Thursday, May 20, 2010 - 10:02 am

I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps 
growing. This is causing file corruption on the underlaying file systems 
as well.  I can copy a group of 100 100mb files and then do a md5sum on 
them and 1-3 will be corrupt. If this is a drive that is bad is there 
anyway to run a report on the count per drive that these mismatches 
occur. I have run smarttools test and do not see one drive that stands 
out to be causing errors. Could something else be causing these errors?
--

From: Neil Brown
Date: Thursday, May 20, 2010 - 2:16 pm

On Thu, 20 May 2010 12:02:23 -0500


When RAID5 detects an inconsistency there is no way to know which device was
wrong.
SMART only detects some errors, not all.
I have had hard drives before which appears to have a single-bit error in
their internal buffer.  No error would be reported, but data you read would
sometimes be wrong.
RAID5 cannot help you with this sort of error.

I would suggest backing up all your data (if it isn't already to late),
breaking the array, and testing each device individually.
e.g. create a filesystem on the device and try copying data on and reading it
off.

NeilBrown
--

From: Trey Scarborough
Date: Thursday, May 20, 2010 - 3:29 pm

Thats what I was afraid of. The problem I have is if I back it up 
knowing what data is bad. Luckily it appears to be a write error because 
once written and correct I can do sums on all the files and I do not see 
anymore errors. I was thinking that there might be a way of do a resync 
and turning up the debug somehow so that it would log the mismatches 
with both the drives that it was reading from at the time. I could then 
take that information and considering there are 9 drives in the array 
the one that comes out having the most should be the culprit. I could 
then remove that drive from the array and test it leaving the rest in a 
state that could be rebuilt and the data being consistant because the 
drive with the bad write errors would be removed. Is this something that 
might be possible?

Thanks,
Trey

--

From: Neil Brown
Date: Thursday, May 20, 2010 - 3:38 pm

On Thu, 20 May 2010 17:29:37 -0500

To detect a mismatch, raid5 reads from all drives in parallel, calculates the
parity across the data blocks and compares that to the parity block.
So no: something like that is not possible.

only thing I can suggest:

- add a write-intent bitmap so you can remove/re-add devices fairly cheaply
- create a v.large file.
- write random data to the file without truncating it. (use dd of=file
  conv=notrunc) then read it back and see if it matches.   If it does, then
  this approach doesn't help.  If it doesn't:

  1 by 1, fail/remove a drive from the array.  Write new random data to the
  same file and read it back and compare.  Then --readd the missing device.
  I'm hoping that you will get an error every time except when the 'bad'
  device has been removed.

NeilBrown
--

From: Doug Ledford
Date: Thursday, May 20, 2010 - 7:16 pm

While a bad drive is certainly a possibility here, this is precisely the
type of failure scenario that would make me suspect bad RAM,


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

From: MRK
Date: Friday, May 21, 2010 - 9:40 am

Could the cabling to the drive be causing this? (maybe failing or maybe 
it's partly disconnected)
I don't remember at what point Linux is at implementing the checksums 
between the controller and the drive.
--

From: Doug Ledford
Date: Friday, May 21, 2010 - 1:57 pm

I don't know.  I'm not up on the SATA signaling details so I don't know
if it uses CRC on the signal, but I suspect it does and a bad cable
would cause failed requests.  But I wouldn't bet my house on it, so I
would ask some SATA gurus.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

From: Tim Small
Date: Monday, May 24, 2010 - 2:34 am

I wouldn't call myself that, but I believe PATA and SATA-level CRC 
errors show up in the UDMA_CRC_Error_Count SMART variable - look for a 
non-zero raw value in the smartctl output.  This is presumably just the 
error-count from the drive's point of view (bad data recd at drive 
end).  I don't know what happens with CRC errors detected at the Linux 
end - and whether detection is controller-dependant.  Better ask on 
linux-ide.


 From the SMART attribute name, presumably the earlier PATA transfer 
modes don't support CRC error detection.

An easy thing to check might be to reduce the libata transfer speed from 
3GBps to 1.5GBps.  Similarly, try to test each drive and SATA port in 
isolation if you can....

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309

--

From: Robert Hancock
Date: Tuesday, May 25, 2010 - 12:09 pm

ATA transfer errors should cause a bad CRC resulting in a failed 
transfer which will cause complaints in the kernel log. For PATA, only 
UDMA modes can detect CRC errors, PIO and MWDMA transfers can't.

There are other places where data corruption can occur however, like 
inside the controller or the drive itself..
--

From: Bill Davidsen
Date: Wednesday, May 26, 2010 - 8:07 am

I have the same thought, I would remove half the RAM from the system and 
test again, then swap to the "other" half and repeat. Of course running 
memtest first is a good idea, but I have seen failures which only happen 
on disk access.

If the system is O/C obviously the first step is to cut the speed back...

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

--

From: Doug Ledford
Date: Wednesday, May 26, 2010 - 8:49 am

Indeed, I've seen lots of failures that only happen with disk access and
not with memory testers.  Hence why I have a shell script on my web page


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

Previous thread: raid 5 mismatch_cnt errors by Trey Scarborough on Thursday, May 20, 2010 - 9:58 am. (1 message)

Next thread: blkid and partition problem by Tim Bostrom on Friday, May 21, 2010 - 5:28 pm. (4 messages)