Re: HD 'Analysis'

Previous thread: Amsterdam OpenBSD 4.5 release party this Thursday, 7th of May. by chefren on Monday, May 4, 2009 - 2:39 pm. (2 messages)

Next thread: Re: [deraadt@cvs.openbsd.org: Re: I would like to send this to misc@ and security-announce@, from me by rembrandt on Monday, May 4, 2009 - 3:04 pm. (2 messages)
From: L. V. Lammert
Subject: HD 'Analysis'
Date: Monday, May 4, 2009 - 2:56 pm

Been trying to build a replacement HD for a system, .. and it seems 
impossible to verify whether a disk is bad or not (having wasted some hours 
rsync'ing data only to have the HD lock up the system when doing the final 
rsync).

What is the best way to do a surface analysis on a disk? badsect seems like 
a holdover from MB-sized disks, and it doesn't do any analysis.

	TIA,

	Lee

From: STeve Andre'
Date: Monday, May 4, 2009 - 3:06 pm

The best way is to get a new disk.  I'm serious.  Disks are cheap enough, and
the value of whats on them is high enough that if you think its going, get a
new one.  Even if this is a hobby system, I'd do that.

There is disk testing software from the OEMs you can use.

But if you think its acting weird don't trust it.

--STeve Andre'

From: L. V. Lammert
Date: Monday, May 4, 2009 - 3:29 pm

And I'm serious too - how many hard drives to you throw away before you 

That's why I'm looking for a way to gather some hard data.

         Lee

From: Jose Quinteiro
Date: Monday, May 4, 2009 - 3:36 pm

From: L. V. Lammert
Date: Tuesday, May 5, 2009 - 9:35 am

Thanks! I have used smart tools in the past, .. but how do you use them for 
testing?

         Lee

From: José Quinteiro
Date: Tuesday, May 5, 2009 - 9:50 am

First thing I do with a new hard drive is run a long self-test using 
smartctl.  If it passes it gets added to the system.  I have smartd set 
to do a daily short self-test and a weekly long self-test on every 
drive.  Replace any drives that start to show errors.

Saludos,
Jose.


From: Steve Shockley
Date: Wednesday, May 6, 2009 - 3:45 am

The self-tests take the drive offline while they run, right?  Do you 
unmount them first, or is the system okay just waiting until the drive 
responds?

From: Martin Schröder
Date: Wednesday, May 6, 2009 - 8:24 am

No. man smartctl

Best
   Martin

From: Steve Shockley
Date: Wednesday, May 6, 2009 - 7:03 pm

Huh.  That kind of contradicts the name "offline self test", but I guess 
they call that "captive".

From: STeve Andre'
Date: Monday, May 4, 2009 - 3:34 pm

I have a pile of disks that I suspect.  Looking at the drawer, I see 8
of them.  As I have time I test them, usually with dd:

   dd if=/dev/sd1c of=/dev/null bs=64k

and try that a bunch.  Usually I shake loose a few errors after a 
while.  And of course, listen to them.  Hearing many clicking
noises for recalibrates in a row is another sign of impending
death.

For laptop disks I've used IBM/Hitahi's drive fitness test, and
that usually works well, but earlier this year it gave a drive a
clean bill of health, and the disk died about 2 weeks later.

--STeve Andre'

From: Hannah Schroeter
Date: Tuesday, May 5, 2009 - 4:46 am

Hi!

               ^r


Kind regards,

Hannah.

From: Tobias Walkowiak
Date: Monday, May 18, 2009 - 6:21 am

From: Otto Moerbeek
Date: Monday, May 18, 2009 - 6:28 am

Because you use block devices for mounting, and raw devices for
anything else. In some cases a block devices might work for a
particular purpose, but why take chances?

        -Otto

From: David Vasek
Date: Monday, May 18, 2009 - 6:50 am

While this rule is probably right, a few tools, such as atactl(8) or 
smartctl from ports say something else in their respective man pages. 
Well, these two utilities are not handling the block themselves, but 
anyway. Is this a bug in the man pages?

Regads,
David

From: Hannah Schroeter
Date: Monday, May 18, 2009 - 6:37 am

Hi!


If nothing else, it'll be much faster.

Kind regards,

Hannah.

From: Tony Abernethy
Date: Monday, May 4, 2009 - 3:45 pm

There is, in the e2fsprogs package, something called badblocks.
I have used it (on Linux) to "rescue" bad disks.
(Windows laptops  -- kinda redundant?)

If you care about your data, follow Steve's advice.

The reality seems to be that this does exercise a disk's ability
to relocate bad sectors so that a bad disk suddenly goes good.
This is using a destructive surface test  (badblocks -sw ...)
Realistically, seems like the most reliable test is that disk is slower
than it should be.

Me, if I want to rely on a disk drive, I will run badblocks on it.
The long-winded destructive test
And I will time it, at least sporadically.
(New disks are not immune from having problems ;-)
The exercise maybe loses out to watching grass grow.

From: L. V. Lammert
Date: Tuesday, May 5, 2009 - 9:11 am

Interesting, .. it DNB on 4.0, however, .. and I'm unsure as to any issues 

Right. How many disks should I throw away before trying to gather some 

Sounds like the best idea - do you run it from a Linux CD, or ??

         Thanks!

         Lee

From: STeve Andre'
Date: Tuesday, May 5, 2009 - 10:09 am

Perhaps I didn't word my thoughts well enough, and appeared snarky
to you?  That wasn't my intent.

Disks today are 1) VASTLY cheaper per meg of storage, 2) Faster, 3)
less power comsumptive and noisy.

But there is also 4) which is they aren't built as well.  The MTBF figures
are a mathmatical fantasy, and dangerously worthless.  I have many
older systems running "small" disks from 2G to about 20G that are
still fine since 1996.  In fact, looking at my log of disk disasters, I've
had three disks blow up when being used by my users, when they
were using those machines.  In contrast, the 60G+ disk era has given
me at least 12 problems in the last four to five years, and I'm not
counting friends systems that I've helped out on.  Probably more 
like 18 disasters+ if I count those.

Because of this I've adopted a really careful attitude about disks
in general.  I'm not starting to treat them like airplane parts--replace
them before they fail.  This is especially true for laptop disks (I've
had four disks start to go on various OpenBSD thinkpads I've had).

When you have free time you can beat on a disk, and take weeks
pounding on it.  Look at iogen in the ports tree as another testing
method.  It is also the case that multiple make builds of userland
is a good test.  I'm hesitant to depend on the smart tools, because
I've had laptop disks that failed hours after a check said things
were fine, and I still have a 100G disk generates smart errors
but which is absolutely good.

Remember too that getting a disk replacement under warranty
almost always results in a "recertified" disk, and I'm nervous about
using them.  Given the cost I get new ones.

Hannah's comment that I should have used the raw device was
quite correct; that was a tyop so it should have said


--STeve Andre'

From: ropers
Date: Wednesday, May 6, 2009 - 4:10 pm

I also would recommend badblocks(8), but I would recommend
  badblocks -svn
instead of badblocks -sw.

badblocks -svn also (s)hows its progress as it goes along, but does a
(v)erbose (n)on-destructive read/write test (as opposed to either the
default read-only test or the destructive read/write test). You can
check an entire device with badblocks, or a partition, or a file. The
great thing about using badblocks to check a partition is that it's
filesystem-agnostic. It will dutifully check every bit of its target
partition regardless of what's actually on it. And if you give
badblocks -svn an entire storage device to test, it will not even care
about the actual partition scheme used. Because this read/write test
can trigger the disk's own built-in bad sector relocation, this means
you can even have a disk that you can't read the partition table from,
and running badblocks -svn over it may at least temporarily fix
things. And I've used badblocks -svn e.g. to check old Macintosh
floppies. Who cares that OpenBSD doesn't know much about the
filesystem on those? badblocks does the job anyway.

(Because of this agnosticism, it's actually questionable whether
badblocks(8) ought to be part of a filesystem-specific package, but
hey, that's what it comes in. Yea, one *could* also argue whether to
include it elsewhere by default because it's so useful, but I'm not
the one making those decisions and I guess the folks who do will do
what makes the most sense to them, so I don't feel like starting to be
a back seat driver... ;-)

Oh, and of course it would probably be prudent to do a backup before
read/write tests, even though badblocks is well-established and (with
-n) supposed to be non-destructive. Supposed to... ;-) I've never been
disappointed but YMMV.

regards,
--ropers

From: Marco Peereboom
Date: Thursday, May 7, 2009 - 5:39 am

You people crack me up.  I have been trying to ignore this post for a
while but can't anymore.  Garbage like badblock are from the era that
you still could low level format a drive.  Remember those fun days?
When you were all excited about your 10MB hard disk?

Use dd to read it; if it is somewhat broken the drive will reallocate
it.  If it is badly broken the IO will fail and it is time to toss the
disk.  Those are about all the flavors you have available.  Running
vendor diags is basically a fancier dd.


From: ropers
Date: Thursday, May 7, 2009 - 1:19 pm

Why do you consider badblocks garbage?

I remember now that we talked about this before over a year ago, when
I first asked about using badblocks on OpenBSD. Back then I eventually
surmised that using dd to do the same thing as badblocks -svn would be
possible but a lot more cumbersome, cf.:
http://kerneltrap.org/mailarchive/openbsd-misc/2008/4/19/1499524

Am I/was I mistaken, and if so, where?

Thanks and regards,
--ropers

From: Tony Abernethy
Date: Thursday, May 7, 2009 - 4:50 pm

OK, I'll take a nibble. (flames invited where I've got anything wrong)

You use OpenBSD where sloppy doesn't quite do what you need to be done.
This is a world where a false sense of security is not your friend.
"This disk is good because it passed badblocks" is NOT valid.
I've got too many "rescued" disks that will probably keep on working.
probably: better then 50%. (but it sounds good)
depending on lots of probables is really instant death.

IF badblocks passed a disk as clean, and there were good reason to 
beleieve that that disk was actually clean, and that it would STAY
clean, then it (badblocks) would be a good program.
Unfortunately, there is not much of anything that badblocks, or the
vendors' programs CAN do that is much of an assurance of reliability.
You might get some idea from the reliability of "reconditioned" 
drives versus the reliability of actually new drives. And the vendors
have better tools (if such as better tools actually exist).

WITHOUT going into HW or OS handling of bad sectors, simply rename
files or directories something like BAD_STUFF and NEVER delete 'em.
There are exotic ways of increasing risk by keeping the most of the
not-failed-yet neighbors as supposedly good sectors.
You can do much of that by partitioning to avoid places with a lot
of bad stuff. With the prices and capacities of modern disks, all
of this must assume that you have lots of time and need something to
occupy that time. Watching grass grow is probably more exciting.

For a new disk (one that does not need to go into production soon)
you can run a very long winded excercise. Seroing and reading 
probably as effective and certainly faster than 0xAA 0x55 0xFF 0x00

There SHOULD be good data forthcoming from the SMART stuff.
BUT, so far I've haven't heard noises from that corner, just wise-
cracks about vendor diags. Presumably, SHOULD does not imply IS.
IF you have anything resembling money, and do not have lots of 
free time on your hands, the best advice seems to be to ...
From: Sean Kamath
Date: Thursday, May 7, 2009 - 10:12 pm

Not with a modern disk.  The drives now essentially lie about where on  
the disk any given block is, you'll never know if block N is anywhere  
(physically) near block N-1 or N+1.

Starting about 15 years ago, the most reasonable check I could find  
was the 'verify' command in solaris' 'format' command (which I've yet  
to find/write a simple alternative to).  Anything else is just a waste  
of time.

What this did was basically write a block of random bits, then read  
and compare.  You need to do both, because some blocks are readable,  
but not writable, and vice versa. If you get a mismatch, the block was  
unreadable, and was (hopefully) remapped, so try again.  The OS  
usually logs read and write errors (soft and/or hard) and you'd have  
some idea of the relative 'health' of the disk.

Frankly, we would verify a disk if we hit a bad block, and if that  
remapped the bad block and produced no other errors over two passes,  
we'd keep using it (disks weren't that cheap then).  If we got another  
error, we'd replace the disk.  We got so many new disks that would  
encounter a bad block (and the OS would log the error) that we started  
verifying the disk when we got them to map out any bad blocks. . .

Sean

[demime 1.01d removed an attachment of type application/pkcs7-signature which had a name of smime.p7s]

From: Steve Shockley
Date: Monday, May 4, 2009 - 7:32 pm

MHDD might do what you want:

http://hddguru.com/content/en/software/2005.10.02-MHDD/

I haven't used it, but Victoria (http://hdd-911.com/) might be useful if 
you can read Russian.

Gibson's Spinrite is okay to check a drive but he tries to imply that 
what he does is way more complicated than it really is.  That, and the 
author is a weenie media whore.

I rarely see a bad drive lock up the system on modern machines without 
timeout messages on the console, etc.  Your controller or cable may be 
suspect if the drive passes all the tests you throw at it.

From: L. V. Lammert
Date: Tuesday, May 5, 2009 - 8:49 am

Some good options, .. seems like all are DOS, however <g>!! I guess that's 
no big deal if you're rebooting for the analysis, but it does not seem 'right'!

         Lee

From: Steve Shockley
Date: Wednesday, May 6, 2009 - 3:52 am

No, they have a Windows version of Victoria! <g>  Personally, I use 
these kinds of utilities to see if a drive is worth saving, when I can 
do destructive tests.  For example I "recovered" a 250gb disk from an 
XServe RAID that i use as a second drive in my work desktop.  SMART 
reports 300 reallocation events, but no matter what I do that doesn't 
increase.  I use it for temporary storage for easy-to-replace data.

From: Marco Peereboom
Date: Wednesday, May 13, 2009 - 3:42 pm

I once wrote a fancy dd to recover a disk that jordan used for pictures.
It worked well enough to get the crap off before the disk totally.
Anyway I dusted it off and added a man page and stuff.  Have a look at
http://www.peereboom.us/diskrescue/
if you want to play.

I'll add some more language to the man page when I get time.

Previous thread: Amsterdam OpenBSD 4.5 release party this Thursday, 7th of May. by chefren on Monday, May 4, 2009 - 2:39 pm. (2 messages)

Next thread: Re: [deraadt@cvs.openbsd.org: Re: I would like to send this to misc@ and security-announce@, from me by rembrandt on Monday, May 4, 2009 - 3:04 pm. (2 messages)