Re: How to fix Current_Pending_Sector?

Previous thread: activating spares by Keld Simonsen on Wednesday, March 10, 2010 - 12:43 am. (2 messages)

Next thread: Creating RAID5 w/ Hotspare? by Carlos Mennens on Thursday, March 11, 2010 - 7:55 am. (2 messages)
From: Iain Rauch
Date: Thursday, March 11, 2010 - 4:51 am

Smartd emailed me to say I have "1 Currently unreadable (pending) sectors".
This actually happened for two disks now.

I ran a check and then a repair on my array and they both gave mismatch_cnt
of 8.

I ran a long self-test on both and they completed without error with no
errors logged. Yet the 'Current_Pending_Sector' is still 1 on both, and one
disk also has a 'UDMA_CRC_Error_Count' of 1.

I ran 'hdrecover' on both and they are both telling me "Couldn't recover
sector 2930277168". It's asking if I want to overwrite it with zeros to fix
it, but I would assume this will damage my array?

The disk sizes are 1500301910016 bytes and I use 1500250M partition sizes
for the array components. Does that sector fall outside my partition, and
hence would it be safe to overwrite it with zeros?

Also, why did I have a mismatch_cnt? I haven't run another check since I did
the repair, as I wanted to fix the pending sector.

BTW, I have a 15 drive RAID6.

Hope y'all can help.


Iain


--

From: Michael Evans
Date: Thursday, March 11, 2010 - 5:06 am

On Thu, Mar 11, 2010 at 3:51 AM, Iain Rauch

If you are running RAID6 and it can read from all but two drives then
it should still be able to calculate whatever would match the
remaining (presumed good) reads to fill the later two drives.  RECENT
kernels will try to write over failed sectors automatically; and only
kick the drive if the write fails.

Please provide more information.

Kernel version
mdadm version

Information about how the source block devices are split up before
mdadm sees them, and any related messages from the system-log.  The
relevant section should be near the end of a dmesg output when you've
just completed a check or repair.  Your syslog probably already
captured the same data and stored it elsewhere.
--

From: Iain Rauch
Date: Thursday, March 11, 2010 - 5:25 am

I thought doing the repair was supposed to fix the issue, but it didn't seem
to touch it. I wonder if it is outside what md sees, but then how would it
have been noticed as unreadable? And is it coincidence that both drives have
the same unreadable sector?

root@Edna:/home/iain# uname -a
Linux Edna 2.6.28-16-server #57-Ubuntu SMP Wed Nov 11 10:34:04 UTC 2009
x86_64 GNU/Linux
root@Edna:/home/iain# mdadm -V
mdadm - v2.6.9 - 10th March 2009

I paste the end of messages below. There's loads of that all the way through
doing the repair so I'm not sure how to filter out the useful bits.


Iain


Mar 10 07:21:21 Edna -- MARK --
Mar 10 07:29:48 Edna kernel: [135073.510019] Modules linked in: appletalk
video output input_polldev nfsd auth_rpcgss exportfs nfs lockd nfs_acl
sunrpc xfs bonding lp ppdev psmouse pcspkr k8temp serio_raw i2c_piix4 r8168
snd_hda_intel snd_pcm snd_timer snd soundcore snd_page_alloc parport_pc
parport shpchp ohci1394 ieee1394 sata_mv raid10 raid456 async_xor
async_memcpy async_tx xor raid1 raid0 multipath linear fbcon tileblit font
bitblit softcursor
Mar 10 07:29:48 Edna kernel: [135073.510019] CPU 0:
Mar 10 07:29:48 Edna kernel: [135073.510019] Modules linked in: appletalk
video output input_polldev nfsd auth_rpcgss exportfs nfs lockd nfs_acl
sunrpc xfs bonding lp ppdev psmouse pcspkr k8temp serio_raw i2c_piix4 r8168
snd_hda_intel snd_pcm snd_timer snd soundcore snd_page_alloc parport_pc
parport shpchp ohci1394 ieee1394 sata_mv raid10 raid456 async_xor
async_memcpy async_tx xor raid1 raid0 multipath linear fbcon tileblit font
bitblit softcursor
Mar 10 07:29:48 Edna kernel: [135073.510019] Pid: 1005, comm: md1_raid5 Not
tainted 2.6.28-16-server #57-Ubuntu
Mar 10 07:29:48 Edna kernel: [135073.510019] RIP: 0010:[<ffffffffa007f7c9>]
[<ffffffffa007f7c9>] raid6_sse24_gen_syndrome+0x1e9/0x28a [raid456]
Mar 10 07:29:48 Edna kernel: [135073.510019] RSP: 0018:ffff88012bd0db58
EFLAGS: 00000297
Mar 10 07:29:48 Edna kernel: [135073.510019] RAX: ffff8800ac397000 ...
From: Stefan /*St0fF*/ Hübner
Date: Thursday, March 11, 2010 - 9:54 am

Hi Iain,

the "Current_pending_sectors" is a smart attribute which gets
incremented during online (reading and writing sectors) AND offline
drive scanning (also called SMART Data Collection), when the drive finds
out a sector cannot be correctly read at the first try (offline data
collection) or after applying various error-correction techniques.
The easiest way to get rid of this problem: dd a sector of zeros onto
the broken sector, then fail the drive, re-add it.  Now wait until the
resync is done.
The fact I'm not sure about is: should one fail and re-add both drives
at once?  As by that the redundancy would get lost...

Speaking about redundancy: our rule of thumb (at xtivate.de) is "each 4
drives need one redundancy" - so a redundancy of 2 with 15 drives is
kind of playing with your luck...

Good luck,
Stefan
--

From: Iain Rauch
Date: Monday, March 15, 2010 - 4:20 am

Well, I failed one of the drives and allowed 'hdrecover' to overwrite the
unreadable sector, but it still couldn't fix it. Here's its report:

Wiping sector 2930277168...
Checking sector is now readable...
I still couldn't read the sector!
I'm sorry, but even writing to the sector hasn't fixed it - there's nothing
more I can do!
Summary:
  1 bad sectors found
  of those 0 were recovered
  and 1 could not be recovered and were destroyed causing data loss

The 'Current_Pending_Sector' was still 1, so I dd zero onto the whole drive.
I guess I could have just done part of it, but I suppose that verified the
whole drive 'works'. It only took ~5 hours. Funnily enough this did fix the
Current_pending_sectors count back to zero. Still no error reports in the
SMART data, and 'Reallocated_Event_Count' didn't go up - shouldn't that have
gone up to one?

I re-partitoned and added it to the array and it rebuilt fine in ~12 hours.

Repeated the process with the second drive and everything's back to normal.

The drive that had the 'UDMA_CRC_Error_Count' still says 1, but I don't
think I need to worry about that?

In direct reply to Stefan:

I think you meant to dd zeros onto the drive /after/ failing it - would have
caused corruption otherwise?

I definitely think it made sense to do one at a time.

One parity drive for every four seems a bit extreme, especially when you
have a backup (which I don't). I'm fairly happy with 15 drives in RAID 6. I
had 24 drives before, and that did give me a few problems :p Just need to
keep the drives healthy. (Array scrubs, SMART tests etc).


Iain


--

From: CoolCold
Date: Thursday, March 18, 2010 - 10:35 am

I had similar issue - there were 5 Currently unreadable (pending)
sectors, 1 Offline uncorrectable sectors then drive was kicked out of
the raid, but readding drive helped - that bad sector gone. Now there
2 pending, 1 uncorrectable, so i gonna fix that two.
My question is - are there any ways to resync array faster? Say if
I'll update bitmaps from current 0.9, fail drive, do dd on sectors,
add drive, will bitmap help to resync not the whole drive, but just
parts which have changed?


On Mon, Mar 15, 2010 at 2:20 PM, Iain Rauch



-- 
Best regards,
[COOLCOLD-RIPN]
--

From: David Rees
Date: Thursday, March 18, 2010 - 12:37 pm

On Mon, Mar 15, 2010 at 4:20 AM, Iain Rauch

No - the drive was able to successfully write to the sector it was
unable to read from.  If the write had failed, it would have
reallocated the sector.

-Dave
--

From: Greg Freemyer
Date: Thursday, March 18, 2010 - 2:47 pm

Dave,

Most sector writes are blind (ie. non-verified).

Is your theory that if the sector is marked as a Pending_Bad_Sector a
write is done, but it is verified, and a reallocate only occurs if the
verify fails?

I've never heard that theory, but it makes great sense.

Greg
--

From: Stefan /*St0fF*/ Hübner
Date: Friday, March 19, 2010 - 12:22 am

Hi Greg,



If the drives has noted errorneous behaviour on a sector (i.e. marked it
pending), it will try to resolve the problem by verify.  It just only


Stefan
--

Previous thread: activating spares by Keld Simonsen on Wednesday, March 10, 2010 - 12:43 am. (2 messages)

Next thread: Creating RAID5 w/ Hotspare? by Carlos Mennens on Thursday, March 11, 2010 - 7:55 am. (2 messages)