All, I'm looking for a bit of guidance here. I have a RAID 6 set up on my system and am seeing some errors in my logs as follows: # cat messages | grep "read erro" Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262528 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262536 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262544 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262552 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262560 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262568 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262576 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262584 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262592 on sda4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923648 on sdb4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923656 on sdb4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923664 on sdb4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923672 on sdb4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923680 on sdb4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923688 on sdb4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923696 on sdb4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923520 on sdc4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923528 on sdc4) Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 sectors at 600923536 ...
You should look into the SMART information on the drives using smartctl. -- Mikael Abrahamsson email: swmike@swm.pp.se --
Lots of tremendous responses. I appreciate it. I'm going to reply to the first person who responded here, but this email should cover some of the questions posed in further responses. Here are some other logs that may be relevant: Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] Unhandled error code Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] Result: hostbyte=0x00 driverbyte=0x06 Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] CDB: cdb[0]=0x28: 28 00 3b e3 53 ea 00 00 48 00 Dec 15 15:40:34 nuova kernel: end_request: I/O error, dev sda, sector 1004753898 Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262528 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262536 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262544 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262552 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262560 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262568 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262576 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262584 on sda4) Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 sectors at 974262592 on sda4) Unfortunately I had not caught those error messages at first glance...I/O error? Hrmm...doesn't sound good. The issue is repeated later on. Dec 29 03:04:01 nuova kernel: sd 1:0:1:0: [sdd] Unhandled error code Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x06 Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 1b 06 d2 ea 00 00 78 00 Dec 29 03:04:01 nuova kernel: end_request: I/O error, dev sdb, sector 453432042 Dec 29 03:04:01 ...
On Thu, 30 Dec 2010 11:33:31 -0500 If your drives run at 68 degrees Celsius, you should emergency-cut the power ASAP and perhaps reach for the nearest fire extinguisher. -- With respect, Roman
Agreed. ;) That's why I posted those messages -- I'm unsure why it would change those values. Here's what smartctl shows for all of the drives: ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Temperature ; done 190 Airflow_Temperature_Cel 0x0022 069 059 045 Old_age Always - 31 (Lifetime Min/Max 23/37) 194 Temperature_Celsius 0x0022 031 041 000 Old_age Always - 31 (0 23 0 0) 190 Airflow_Temperature_Cel 0x0022 068 058 045 Old_age Always - 32 (Lifetime Min/Max 22/38) 194 Temperature_Celsius 0x0022 032 042 000 Old_age Always - 32 (0 22 0 0) 190 Airflow_Temperature_Cel 0x0022 068 057 045 Old_age Always - 32 (Lifetime Min/Max 22/38) 194 Temperature_Celsius 0x0022 032 043 000 Old_age Always - 32 (0 22 0 0) 190 Airflow_Temperature_Cel 0x0022 069 059 045 Old_age Always - 31 (Lifetime Min/Max 23/37) 194 Temperature_Celsius 0x0022 031 041 000 Old_age Always - 31 (0 23 0 0) Those values seem appropriate, particularly since the "max" is 37 (as --
Not sure why the log is showing the weird C temp. The output from smartctrl looks correct. The max is not defined by the manufacture, but the maximum temp the drive has reached. Ryan --
Fair enough. :) Thanks for the response. So the big question (to all) becomes this: is this a hard drive issue, or a motherboard / SATA controller issue? Either one would suck, but hard drives are obviously easier to swap than a motherboard. Thoughts on how to go about diagnosing the issue further to determine what is going on would be greatly appreciated. Aside from replacing all the drives and hoping for the best, I don't see an easy way to really figure out what is causing the I/O errors that are resulting in bad sectors. -james --
When md/raid6 tries to read from a device and gets a read error, it try to read from other other devices. When that succeeds it computes the data that it had tried to read and then write it back to the original drive. If this succeeded is assumes that the read error has been correct by a write, and A few occasional messages like this are fairly benign. The could be a sign that the drive surface is degrading. If you see lots of these messages, then you should seriously consider replacing the drive. As you are seeing these message across all devices, it is possible that the problem is with the sata controller rather than the disks. Do know which you should check the errors that are reported in dmesg. If you don't understand these message, then post them to the list - feel free to post several hundred lines of logs - too much is much much better than not enough. --
"Unhandled error code" sounds like it could be a driver problem... Try googling that error message... http://us.generation-nt.com/answer/2-6-33-libata-issues-via-sata-pata-controller-help-... "Also, please try the latest 2.6.34-rc kernel, as that has several fixes for both pata_via and sata_via which did not make 2.6.33." What kernel are you running??? --
Neil, I'm runinng 2.6.35. Although an expensive route, the only thing I can think to do to determine 100% whether the issue is software or hardware (and, if hardware, whether SATA controller or the drives) is to swap the drives out. Ouch! Any other ideas, however, would be appreciated before I drop a few hundred bucks. :) -james --
} -----Original Message----- } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- } owner@vger.kernel.org] On Behalf Of James } Sent: Thursday, December 30, 2010 8:48 PM } To: Neil Brown } Cc: linux-raid@vger.kernel.org } Subject: Re: read errors corrected } } Neil, } } I'm runinng 2.6.35. } } Although an expensive route, the only thing I can think to do to } determine 100% whether the issue is software or hardware (and, if } hardware, whether SATA controller or the drives) is to swap the drives } out. } } Ouch! } } Any other ideas, however, would be appreciated before I drop a few } hundred bucks. :) Just swap out 1 for now? :) I believe your drives are fine because your smart stats don't reflect the number of errors you see in the logs. } } -james } } On Thu, Dec 30, 2010 at 23:12, Neil Brown <neilb@suse.de> wrote: } > On Thu, 30 Dec 2010 11:35:59 -0500 James <jtp@nc.rr.com> wrote: } > } >> Sorry Neil, I meant to reply-all. } >> } >> -james } >> } >> On Thu, Dec 30, 2010 at 11:35, James <jtp@nc.rr.com> wrote: } >> > Inline. } >> > } >> > On Thu, Dec 30, 2010 at 04:15, Neil Brown <neilb@suse.de> wrote: } >> >> On Thu, 30 Dec 2010 03:20:48 +0000 James <jtp@nc.rr.com> wrote: } >> >> } >> >>> All, } >> >>> } >> >>> I'm looking for a bit of guidance here. I have a RAID 6 set up on } my } >> >>> system and am seeing some errors in my logs as follows: } >> >>> } >> >>> # cat messages | grep "read erro" } >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 } >> >>> sectors at 974262528 on sda4) } >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 } >> >>> sectors at 974262536 on sda4) } >> >> ..... } >> >> } >> >>> } >> >>> I've Google'd the heck out of this error message but am not seeing } a } >> >>> clear and concise message: is this benign? What would cause these } >> >>> errors? Should I be concerned? } >> >>> } >> >>> There is an error message (read error corrected) on each of the } ...
Buy a PCIe SATA controller, plug it in and move some/all drives over to that? Should be a lot less than $100. Make sure it is a different chipset to what you have on your motherboard. NeilBrown --
(a) these errors usually come from defective disk sectors. raid recostructs the missing sector from parity from other disks in the array, then rewrites the sector on the defective disk; if the sector is rewritten without error (maybe the hd remaps the sector into its reserved area), then just the log messages is displayed. (b) with raid-6 it's almost benign; to get troubles you should get a read error on same sector for >2 disks; or have 2 disks failed and out of the array and get a read error on one of the other disks while recostructing the array; or have 1 disk failed and get a read error on same sector on >1 disk while recostructing (with raid-5 it's almost dangerous instead, as you can have big troubles if a disk fails and you get a read error on another disk while recostructing; that happened to me!) (c) no; it's also a good rule to perform a periodic scrub of the array (check of the array), to reveal and correct defective sectors (d) check smart status of the disks, for "relocated sectors count"; also if md superblock is >= 1 there is a persistent count of corrected read errors for each device into /sys/block/mdXX/md/dev-XX/errors, when this counter reaches 256 the disk is marked failed; ihmo when a disk is giving even few corrected read errors in a short interval its better to replace it. -- Yours faithfully. Giovanni Tessore --
