Re: wd0 read timeouts - how to proceed?

Previous thread: Clínica Sistémica - Verano 2011 by difusion-esa on Thursday, December 23, 2010 - 8:25 pm. (1 message)

Next thread: Bonjour tres cher(e) by heinekein resultat (via Multiply) on Friday, December 24, 2010 - 4:35 am. (1 message)
From: Webcharge
Date: Friday, December 24, 2010 - 3:00 am

Must be the holiday season *sigh*.... my OpenBSD server is suddenly 
giving the occassional read-timeout on the /var slice of the main harddisk:

-------
wd0(pciide0:0:0): timeout
         type: ata
         c_bcount: 65536
         c_skip: 0
wd0g: device timeout reading fsbn 17002464 of 17002464-17002591 (wd0 bn 
67334928; cn 66800 tn 8 sn 24), retrying
wd0: soft error (corrected)
-------

Is this the actual disk or the controller/other hardware? Either way it 
needs a fix.

My problem is this is a live system that is not close by. I would very 
much prefer to 'fix' this remotely to buy some time to replace the 
machine completely.
I do have offsite backups of essential data but not a spare system in 
the rack at this very moment.
Not to mention I would like to avoid spending X-mas alone in the datacenter.

There is a second harddisk installed, with OpenBSD formatted slices, but 
of different proportions. This (larger) disk is unused, so data / layout 
may be wiped,
so it seems like smart idea to copy the data at least (I do have offsite 
backups of essential data but not a spare system in the rack at this 
very moment)

Can I "just copy /var (wd0g)  to /var2 (wd1i) and remount" or should I 
proceed otherwise or would copy/remounting /var simply not work on a 
live system?

Or, possibly, I could 'clone' the whole wd0 disk to wd1 and use that 
instead of wd1?
I understood you will need to boot in single user mode for this [1] and 
or have identical disks [2],  or is there another (remote-safe) way?

Any advice is highly appreciated!

Thanks, and happy holidays,

Matt

[1] http://unixsadm.blogspot.com/2007/08/cloning-disk-in-openbsd.html
[2] http://monkey.org/openbsd/archive/tech/0112/msg00079.html

From: Joachim Schipper
Date: Friday, December 24, 2010 - 4:07 am

If the system is quiet, you can try 'sync; sync; dd ...; fsck', but
something like 'tar cpf - | tar xpf -' is more likely to get you a
somewhat consistent view. Change /etc/fstab and reboot (you *can* try
mounting the new /var over the old one, but you'll want to play with
fstat -n to see which processes are still accessing the old /var.)

Of course, this isn't guaranteed to work. In particular, if something is
actually writing to /var, your view won't be consistent. Even more in
particular, don't try this with running databases.

		Joachim

From: Vadim Zhukov
Date: Friday, December 24, 2010 - 5:50 am

POSIX pax(1) with -rw options should work slightly faster (and it's
already faster to type ;) ).

--
  WBR,
  Vadim Zhukov

From: Chris Smith
Date: Friday, December 24, 2010 - 9:09 am

If the hardware is "smart" aware installing smartmontools and running
smartctl may give you a clue.

From: Gabriel Linder
Date: Friday, December 24, 2010 - 9:20 am

atactl(8) works just fine.

Previous thread: Clínica Sistémica - Verano 2011 by difusion-esa on Thursday, December 23, 2010 - 8:25 pm. (1 message)

Next thread: Bonjour tres cher(e) by heinekein resultat (via Multiply) on Friday, December 24, 2010 - 4:35 am. (1 message)