recovering from a controller failure

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Kyler Laird
Date: Saturday, May 29, 2010 - 12:07 pm

Recently a drive failed on one of our file servers.  The machine has
three RAID6 arrays (15 1TB each plus spares).  I let the spare rebuild
and then started the process of replacing the drive.

Unfortunately I'd misplaced the list of drive IDs so I generated a new
list in order to identify the failed drive.  I used "smartctl" and made
a quick script to scan all 48 drives and generate pretty output.  That
was a mistake.  After running it a couple times one of the controllers
failed and several disks in the first array were failed.

I worked on the machine for awhile.  (It has an NFS root.)  I got some
information from it before it rebooted (via watchdog).  I've dumped all
of the information here.
	http://lairds.us/temp/ucmeng_md/

In mdstat_0 you can see the status of the arrays right after the
controller failure.  mdstat_1 shows the status after reboot.

sys_block shows a listing of the block devices so you can see that the
problem drives are on controller 1.

The examine_sd?1 files show -E output from each drive in md0.  Note that
the Events count is different for the drives on the problem controller.

I'd like to know if this is something I can recover.  I do have backups
but it's a huge pain to recover this much data.

Thank you.

--kyler
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
recovering from a controller failure, Kyler Laird, (Sat May 29, 12:07 pm)
Re: recovering from a controller failure, Berkey B Walker, (Sat May 29, 12:46 pm)
Re: recovering from a controller failure, Kyler Laird, (Sat May 29, 1:44 pm)
Re: recovering from a controller failure, Richard, (Sat May 29, 2:18 pm)
Re: recovering from a controller failure, Kyler Laird, (Sat May 29, 2:36 pm)
Re: recovering from a controller failure, Richard, (Sat May 29, 2:38 pm)
Re: recovering from a controller failure, Berkey B Walker, (Sat May 29, 2:43 pm)
Re: recovering from a controller failure, Kyler Laird, (Sat May 29, 2:45 pm)
Re: recovering from a controller failure, Richard, (Sat May 29, 2:50 pm)
Re: recovering from a controller failure, Richard, (Sat May 29, 2:59 pm)
Re: recovering from a controller failure, Kyler Laird, (Sat May 29, 5:15 pm)
Re: recovering from a controller failure, Richard, (Sat May 29, 5:28 pm)
Re: recovering from a controller failure, Richard, (Sat May 29, 5:54 pm)