Re: read errors corrected

Previous thread: Considering a complete rework of RAID on my home compute server by Mark Knecht on Wednesday, December 29, 2010 - 12:54 pm. (1 message)

Next thread: New raid level suggestion. by Rogier Wolff on Thursday, December 30, 2010 - 1:23 am. (14 messages)
From: James
Date: Wednesday, December 29, 2010 - 8:20 pm

All,

I'm looking for a bit of guidance here. I have a RAID 6 set up on my
system and am seeing some errors in my logs as follows:

# cat messages | grep "read erro"
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262528 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262536 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262544 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262552 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262560 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262568 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262576 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262584 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262592 on sda4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923648 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923656 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923664 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923672 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923680 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923688 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923696 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923520 on sdc4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923528 on sdc4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923536 ...
From: Mikael Abrahamsson
Date: Wednesday, December 29, 2010 - 10:24 pm

You should look into the SMART information on the drives using smartctl.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se
--

From: James
Date: Thursday, December 30, 2010 - 9:33 am

Lots of tremendous responses. I appreciate it. I'm going to reply to
the first person who responded here, but this email should cover some
of the questions posed in further responses.


Here are some other logs that may be relevant:

Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] Unhandled error code
Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] Result: hostbyte=0x00
driverbyte=0x06
Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] CDB: cdb[0]=0x28: 28
00 3b e3 53 ea 00 00 48 00
Dec 15 15:40:34 nuova kernel: end_request: I/O error, dev sda, sector 1004753898
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262528 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262536 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262544 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262552 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262560 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262568 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262576 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262584 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262592 on sda4)

Unfortunately I had not caught those error messages at first
glance...I/O error? Hrmm...doesn't sound good. The issue is repeated
later on.

Dec 29 03:04:01 nuova kernel: sd 1:0:1:0: [sdd] Unhandled error code
Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code
Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=0x00
driverbyte=0x06
Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28
00 1b 06 d2 ea 00 00 78 00
Dec 29 03:04:01 nuova kernel: end_request: I/O error, dev sdb, sector 453432042
Dec 29 03:04:01 ...
From: Roman Mamedov
Date: Thursday, December 30, 2010 - 9:44 am

On Thu, 30 Dec 2010 11:33:31 -0500

If your drives run at 68 degrees Celsius, you should emergency-cut the power
ASAP and perhaps reach for the nearest fire extinguisher.

-- 
With respect,
Roman
From: James
Date: Thursday, December 30, 2010 - 9:51 am

Agreed. ;) That's why I posted those messages -- I'm unsure why it
would change those values.

Here's what smartctl shows for all of the drives:

~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Temperature ;
done
190 Airflow_Temperature_Cel 0x0022   069   059   045    Old_age
Always       -       31 (Lifetime Min/Max 23/37)
194 Temperature_Celsius     0x0022   031   041   000    Old_age
Always       -       31 (0 23 0 0)
190 Airflow_Temperature_Cel 0x0022   068   058   045    Old_age
Always       -       32 (Lifetime Min/Max 22/38)
194 Temperature_Celsius     0x0022   032   042   000    Old_age
Always       -       32 (0 22 0 0)
190 Airflow_Temperature_Cel 0x0022   068   057   045    Old_age
Always       -       32 (Lifetime Min/Max 22/38)
194 Temperature_Celsius     0x0022   032   043   000    Old_age
Always       -       32 (0 22 0 0)
190 Airflow_Temperature_Cel 0x0022   069   059   045    Old_age
Always       -       31 (Lifetime Min/Max 23/37)
194 Temperature_Celsius     0x0022   031   041   000    Old_age
Always       -       31 (0 23 0 0)

Those values seem appropriate, particularly since the "max" is 37 (as
--

From: Ryan Wagoner
Date: Thursday, December 30, 2010 - 10:59 am

Not sure why the log is showing the weird C temp. The output from
smartctrl looks correct. The max is not defined by the manufacture,
but the maximum temp the drive has reached.

Ryan
--

From: James
Date: Thursday, December 30, 2010 - 11:03 am

Fair enough. :) Thanks for the response.

So the big question (to all) becomes this: is this a hard drive issue,
or a motherboard / SATA controller issue? Either one would suck, but
hard drives are obviously easier to swap than a motherboard.

Thoughts on how to go about diagnosing the issue further to determine
what is going on would be greatly appreciated. Aside from replacing
all the drives and hoping for the best, I don't see an easy way to
really figure out what is causing the I/O errors that are resulting in
bad sectors.

-james

--

From: Neil Brown
Date: Thursday, December 30, 2010 - 2:15 am

When md/raid6 tries to read from a device and gets a read error, it try to
read from other other devices.  When that succeeds it computes the data that
it had tried to read and then write it back to the original drive.  If this
succeeded is assumes that the read error has been correct by a write, and

A few occasional messages like this are fairly benign.  The could be a sign
that the drive surface is degrading.  If you see lots of these messages, then
you should seriously consider replacing the drive.

As you are seeing these message across all devices, it is possible that the
problem is with the sata controller rather than the disks.  Do know which you
should check the errors that are reported in dmesg.  If you don't understand
these message, then post them to the list - feel free to post several hundred
lines of logs - too much is much much better than not enough.


--

From: James
Date: Thursday, December 30, 2010 - 9:35 am

Sorry Neil, I meant to reply-all.

-james

--

From: Neil Brown
Date: Thursday, December 30, 2010 - 4:12 pm

"Unhandled error code" sounds like it could be a driver problem...

Try googling that error message...

http://us.generation-nt.com/answer/2-6-33-libata-issues-via-sata-pata-controller-help-...


"Also, please try the latest 2.6.34-rc kernel, as that has several fixes
for both pata_via and sata_via which did not make 2.6.33."

What kernel are  you running???


--

From: James
Date: Thursday, December 30, 2010 - 6:48 pm

Neil,

I'm runinng 2.6.35.

Although an expensive route, the only thing I can think to do to
determine 100% whether the issue is software or hardware (and, if
hardware, whether SATA controller or the drives) is to swap the drives
out.

Ouch!

Any other ideas, however, would be appreciated before I drop a few
hundred bucks. :)

-james

--

From: Guy Watkins
Date: Thursday, December 30, 2010 - 6:56 pm

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of James
} Sent: Thursday, December 30, 2010 8:48 PM
} To: Neil Brown
} Cc: linux-raid@vger.kernel.org
} Subject: Re: read errors corrected
} 
} Neil,
} 
} I'm runinng 2.6.35.
} 
} Although an expensive route, the only thing I can think to do to
} determine 100% whether the issue is software or hardware (and, if
} hardware, whether SATA controller or the drives) is to swap the drives
} out.
} 
} Ouch!
} 
} Any other ideas, however, would be appreciated before I drop a few
} hundred bucks. :)

Just swap out 1 for now?  :)

I believe your drives are fine because your smart stats don't reflect the
number of errors you see in the logs.

} 
} -james
} 
} On Thu, Dec 30, 2010 at 23:12, Neil Brown <neilb@suse.de> wrote:
} > On Thu, 30 Dec 2010 11:35:59 -0500 James <jtp@nc.rr.com> wrote:
} >
} >> Sorry Neil, I meant to reply-all.
} >>
} >> -james
} >>
} >> On Thu, Dec 30, 2010 at 11:35, James <jtp@nc.rr.com> wrote:
} >> > Inline.
} >> >
} >> > On Thu, Dec 30, 2010 at 04:15, Neil Brown <neilb@suse.de> wrote:
} >> >> On Thu, 30 Dec 2010 03:20:48 +0000 James <jtp@nc.rr.com> wrote:
} >> >>
} >> >>> All,
} >> >>>
} >> >>> I'm looking for a bit of guidance here. I have a RAID 6 set up on
} my
} >> >>> system and am seeing some errors in my logs as follows:
} >> >>>
} >> >>> # cat messages | grep "read erro"
} >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
} >> >>> sectors at 974262528 on sda4)
} >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
} >> >>> sectors at 974262536 on sda4)
} >> >> .....
} >> >>
} >> >>>
} >> >>> I've Google'd the heck out of this error message but am not seeing
} a
} >> >>> clear and concise message: is this benign? What would cause these
} >> >>> errors? Should I be concerned?
} >> >>>
} >> >>> There is an error message (read error corrected) on each of the
} ...
From: Neil Brown
Date: Thursday, December 30, 2010 - 7:08 pm

Buy a PCIe SATA controller, plug it in and move some/all drives over to that?
Should be a lot less than $100.  Make sure it is a different chipset to what
you have on your motherboard.

NeilBrown
--

From: Giovanni Tessore
Date: Thursday, December 30, 2010 - 3:13 am

(a) these errors usually come from defective disk sectors. raid 
recostructs the missing sector from parity from other disks in the 
array, then rewrites the sector on the defective disk; if the sector is 
rewritten without error (maybe the hd remaps the sector into its 
reserved area), then just the log messages is displayed.

(b) with raid-6 it's almost benign; to get troubles you should get a 
read error on same sector for >2 disks; or have 2 disks failed and out 
of the array and get a read error on one of the other disks while 
recostructing the array; or have 1 disk failed and get a read error on 
same sector on >1 disk while recostructing (with raid-5 it's almost 
dangerous instead, as you can have big troubles if a disk fails and you 
get a read error on another disk while recostructing; that happened to me!)

(c) no; it's also a good rule to perform a periodic scrub of the array 
(check of the array), to reveal and correct defective sectors

(d) check smart status of the disks, for "relocated sectors count"; also 
if md superblock is >= 1 there is a persistent count of corrected read 
errors for each device into /sys/block/mdXX/md/dev-XX/errors, when this 
counter reaches 256 the disk is marked failed; ihmo when a disk is 
giving even few corrected read errors in a short interval its better to 
replace it.

-- 
Yours faithfully.

Giovanni Tessore


--

Previous thread: Considering a complete rework of RAID on my home compute server by Mark Knecht on Wednesday, December 29, 2010 - 12:54 pm. (1 message)

Next thread: New raid level suggestion. by Rogier Wolff on Thursday, December 30, 2010 - 1:23 am. (14 messages)