Been trying to build a replacement HD for a system, .. and it seems impossible to verify whether a disk is bad or not (having wasted some hours rsync'ing data only to have the HD lock up the system when doing the final rsync). What is the best way to do a surface analysis on a disk? badsect seems like a holdover from MB-sized disks, and it doesn't do any analysis. TIA, Lee
The best way is to get a new disk. I'm serious. Disks are cheap enough, and the value of whats on them is high enough that if you think its going, get a new one. Even if this is a hobby system, I'd do that. There is disk testing software from the OEMs you can use. But if you think its acting weird don't trust it. --STeve Andre'
And I'm serious too - how many hard drives to you throw away before you
That's why I'm looking for a way to gather some hard data.
Lee
First thing I do with a new hard drive is run a long self-test using smartctl. If it passes it gets added to the system. I have smartd set to do a daily short self-test and a weekly long self-test on every drive. Replace any drives that start to show errors. Saludos, Jose.
The self-tests take the drive offline while they run, right? Do you unmount them first, or is the system okay just waiting until the drive responds?
Huh. That kind of contradicts the name "offline self test", but I guess they call that "captive".
I have a pile of disks that I suspect. Looking at the drawer, I see 8 of them. As I have time I test them, usually with dd: dd if=/dev/sd1c of=/dev/null bs=64k and try that a bunch. Usually I shake loose a few errors after a while. And of course, listen to them. Hearing many clicking noises for recalibrates in a row is another sign of impending death. For laptop disks I've used IBM/Hitahi's drive fitness test, and that usually works well, but earlier this year it gave a drive a clean bill of health, and the disk died about 2 weeks later. --STeve Andre'
Because you use block devices for mounting, and raw devices for
anything else. In some cases a block devices might work for a
particular purpose, but why take chances?
-Otto
While this rule is probably right, a few tools, such as atactl(8) or smartctl from ports say something else in their respective man pages. Well, these two utilities are not handling the block themselves, but anyway. Is this a bug in the man pages? Regads, David
There is, in the e2fsprogs package, something called badblocks. I have used it (on Linux) to "rescue" bad disks. (Windows laptops -- kinda redundant?) If you care about your data, follow Steve's advice. The reality seems to be that this does exercise a disk's ability to relocate bad sectors so that a bad disk suddenly goes good. This is using a destructive surface test (badblocks -sw ...) Realistically, seems like the most reliable test is that disk is slower than it should be. Me, if I want to rely on a disk drive, I will run badblocks on it. The long-winded destructive test And I will time it, at least sporadically. (New disks are not immune from having problems ;-) The exercise maybe loses out to watching grass grow.
Interesting, .. it DNB on 4.0, however, .. and I'm unsure as to any issues
Right. How many disks should I throw away before trying to gather some
Sounds like the best idea - do you run it from a Linux CD, or ??
Thanks!
Lee
Perhaps I didn't word my thoughts well enough, and appeared snarky to you? That wasn't my intent. Disks today are 1) VASTLY cheaper per meg of storage, 2) Faster, 3) less power comsumptive and noisy. But there is also 4) which is they aren't built as well. The MTBF figures are a mathmatical fantasy, and dangerously worthless. I have many older systems running "small" disks from 2G to about 20G that are still fine since 1996. In fact, looking at my log of disk disasters, I've had three disks blow up when being used by my users, when they were using those machines. In contrast, the 60G+ disk era has given me at least 12 problems in the last four to five years, and I'm not counting friends systems that I've helped out on. Probably more like 18 disasters+ if I count those. Because of this I've adopted a really careful attitude about disks in general. I'm not starting to treat them like airplane parts--replace them before they fail. This is especially true for laptop disks (I've had four disks start to go on various OpenBSD thinkpads I've had). When you have free time you can beat on a disk, and take weeks pounding on it. Look at iogen in the ports tree as another testing method. It is also the case that multiple make builds of userland is a good test. I'm hesitant to depend on the smart tools, because I've had laptop disks that failed hours after a check said things were fine, and I still have a 100G disk generates smart errors but which is absolutely good. Remember too that getting a disk replacement under warranty almost always results in a "recertified" disk, and I'm nervous about using them. Given the cost I get new ones. Hannah's comment that I should have used the raw device was quite correct; that was a tyop so it should have said --STeve Andre'
I also would recommend badblocks(8), but I would recommend badblocks -svn instead of badblocks -sw. badblocks -svn also (s)hows its progress as it goes along, but does a (v)erbose (n)on-destructive read/write test (as opposed to either the default read-only test or the destructive read/write test). You can check an entire device with badblocks, or a partition, or a file. The great thing about using badblocks to check a partition is that it's filesystem-agnostic. It will dutifully check every bit of its target partition regardless of what's actually on it. And if you give badblocks -svn an entire storage device to test, it will not even care about the actual partition scheme used. Because this read/write test can trigger the disk's own built-in bad sector relocation, this means you can even have a disk that you can't read the partition table from, and running badblocks -svn over it may at least temporarily fix things. And I've used badblocks -svn e.g. to check old Macintosh floppies. Who cares that OpenBSD doesn't know much about the filesystem on those? badblocks does the job anyway. (Because of this agnosticism, it's actually questionable whether badblocks(8) ought to be part of a filesystem-specific package, but hey, that's what it comes in. Yea, one *could* also argue whether to include it elsewhere by default because it's so useful, but I'm not the one making those decisions and I guess the folks who do will do what makes the most sense to them, so I don't feel like starting to be a back seat driver... ;-) Oh, and of course it would probably be prudent to do a backup before read/write tests, even though badblocks is well-established and (with -n) supposed to be non-destructive. Supposed to... ;-) I've never been disappointed but YMMV. regards, --ropers
You people crack me up. I have been trying to ignore this post for a while but can't anymore. Garbage like badblock are from the era that you still could low level format a drive. Remember those fun days? When you were all excited about your 10MB hard disk? Use dd to read it; if it is somewhat broken the drive will reallocate it. If it is badly broken the IO will fail and it is time to toss the disk. Those are about all the flavors you have available. Running vendor diags is basically a fancier dd.
Why do you consider badblocks garbage? I remember now that we talked about this before over a year ago, when I first asked about using badblocks on OpenBSD. Back then I eventually surmised that using dd to do the same thing as badblocks -svn would be possible but a lot more cumbersome, cf.: http://kerneltrap.org/mailarchive/openbsd-misc/2008/4/19/1499524 Am I/was I mistaken, and if so, where? Thanks and regards, --ropers
OK, I'll take a nibble. (flames invited where I've got anything wrong) You use OpenBSD where sloppy doesn't quite do what you need to be done. This is a world where a false sense of security is not your friend. "This disk is good because it passed badblocks" is NOT valid. I've got too many "rescued" disks that will probably keep on working. probably: better then 50%. (but it sounds good) depending on lots of probables is really instant death. IF badblocks passed a disk as clean, and there were good reason to beleieve that that disk was actually clean, and that it would STAY clean, then it (badblocks) would be a good program. Unfortunately, there is not much of anything that badblocks, or the vendors' programs CAN do that is much of an assurance of reliability. You might get some idea from the reliability of "reconditioned" drives versus the reliability of actually new drives. And the vendors have better tools (if such as better tools actually exist). WITHOUT going into HW or OS handling of bad sectors, simply rename files or directories something like BAD_STUFF and NEVER delete 'em. There are exotic ways of increasing risk by keeping the most of the not-failed-yet neighbors as supposedly good sectors. You can do much of that by partitioning to avoid places with a lot of bad stuff. With the prices and capacities of modern disks, all of this must assume that you have lots of time and need something to occupy that time. Watching grass grow is probably more exciting. For a new disk (one that does not need to go into production soon) you can run a very long winded excercise. Seroing and reading probably as effective and certainly faster than 0xAA 0x55 0xFF 0x00 There SHOULD be good data forthcoming from the SMART stuff. BUT, so far I've haven't heard noises from that corner, just wise- cracks about vendor diags. Presumably, SHOULD does not imply IS. IF you have anything resembling money, and do not have lots of free time on your hands, the best advice seems to be to ...
Not with a modern disk. The drives now essentially lie about where on the disk any given block is, you'll never know if block N is anywhere (physically) near block N-1 or N+1. Starting about 15 years ago, the most reasonable check I could find was the 'verify' command in solaris' 'format' command (which I've yet to find/write a simple alternative to). Anything else is just a waste of time. What this did was basically write a block of random bits, then read and compare. You need to do both, because some blocks are readable, but not writable, and vice versa. If you get a mismatch, the block was unreadable, and was (hopefully) remapped, so try again. The OS usually logs read and write errors (soft and/or hard) and you'd have some idea of the relative 'health' of the disk. Frankly, we would verify a disk if we hit a bad block, and if that remapped the bad block and produced no other errors over two passes, we'd keep using it (disks weren't that cheap then). If we got another error, we'd replace the disk. We got so many new disks that would encounter a bad block (and the OS would log the error) that we started verifying the disk when we got them to map out any bad blocks. . . Sean [demime 1.01d removed an attachment of type application/pkcs7-signature which had a name of smime.p7s]
MHDD might do what you want: http://hddguru.com/content/en/software/2005.10.02-MHDD/ I haven't used it, but Victoria (http://hdd-911.com/) might be useful if you can read Russian. Gibson's Spinrite is okay to check a drive but he tries to imply that what he does is way more complicated than it really is. That, and the author is a weenie media whore. I rarely see a bad drive lock up the system on modern machines without timeout messages on the console, etc. Your controller or cable may be suspect if the drive passes all the tests you throw at it.
Some good options, .. seems like all are DOS, however <g>!! I guess that's
no big deal if you're rebooting for the analysis, but it does not seem 'right'!
Lee
No, they have a Windows version of Victoria! <g> Personally, I use these kinds of utilities to see if a drive is worth saving, when I can do destructive tests. For example I "recovered" a 250gb disk from an XServe RAID that i use as a second drive in my work desktop. SMART reports 300 reallocation events, but no matter what I do that doesn't increase. I use it for temporary storage for easy-to-replace data.
I once wrote a fancy dd to recover a disk that jordan used for pictures. It worked well enough to get the crap off before the disk totally. Anyway I dusted it off and added a man page and stuff. Have a look at http://www.peereboom.us/diskrescue/ if you want to play. I'll add some more language to the man page when I get time.
