Let me follow-up to share what I have learned and what I have managed
to do to get this array to re-assemble.
I've received several responses from people telling me that they don't
have any problem with their "desktop class" drives being dropped from
the array. Congratulations to you all. I suspect that there may be a
theme in the drives that you are using which may have different error
correction, may be smaller than 500GB or may not support the SCT
command set.
One of the first responses I received privately was from a gentlemen
that gave me the hint I needed regarding the SCT-ERC command. He
shared my frustration and actually presents a very compelling example
where this is a big problem. He works to support a commercial NAS
product which uses "desktop" class drives and fights this problem
continually.
With this new knowledge gained I started digging a bit more and ran
across a set of patches to smarttools which allows editing the values
for SCT-ERC. You can find that source here:
http://www.csc.liv.ac.uk/~greg/projects/erc/
FWIW, the Seagate Barracudas that I am running have non-volatile
storage for this variable. Not that I am recommending Seagate. Far
from it....
I can confirm that all of my drives had this value "disabled" which
means it allows the drive to go off and take as much time as it needs
to fix its own problem.
I set the values to 7 seconds for the 4 drives in my array and
attempted to rebuild the array. Unfortunately, it failed again. So I
reset the values to 5 seconds and fired off the rebuild once again and
managed to get through the rebuild process.
Now this solution does not satisfy the situation where you are
hot-plugging drives, but it at least gets me over my hurdle.
Seems it would be a nice improvement to md to actually detect the
SCT-ERC setting, warn when it cannot change the value and offer to set
these to reasonable values for the RAID application.
Here's to happy storage...
On Wed, Mar 17, 2010 at 7:48 AM, Randy Terbush <randy@terbush.org> wrote: