> On Mon, Jun 02, 2008 at 04:04:12PM +1000, Andrew Hill wrote:
> > On Mon, May 19, 2008 at 1:11 AM, Andrew Hill <lists@thefrog.net> wrote:
> > > i tend to find that the timeouts occur on one or two disks at once -
> > > e.g. ad0 and 2 will complain of timeouts, and the system locks up
> > > shortly thereafter...
> >
> > after spitting out the usual errors from ad0 and ad2 (in this case) with
> > TIMEOUTs and subsequent FAILUREs on READ_DMA[48] and WRITE_DMA[48]...
> >
> > i got the following panic
> >
> > vm_fault: pager read error, pid 1552 (tlsmgr)
> > ad0: FAILURE - READ_DMA48 timed out LBA=352903900
> > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 437, size: 4096
> > ad2: FAILURE - WRITE_DMA timed out LBA=239717693
> > panic: ZFS: I/O failure (write on <unknown> off 0: zio 0xffffff001d47c810
> > [L0 ZIL intent log] b000L/b000P DVA[0]=<0:c807795000:d000> zilog
> > uncompressed LE contiguous birth=750230 fill=0
> > cksum=69f76525a84e1816:f6d86fe1d94cd68c:39:8af): error 5
> > KDB: enter: panic
> > [thread pid 72 tid 100071 ]
> > Stopped at kdb_enter_why+0x3d: movq __PLACEHOLDER__0_,0x39b248(%rip)
> > db>
>
> I would say the ZFS crash is a result of the ad0/ad2 timeouts. The ZIL
> log shows a hard checksum failure in the ZIL, which indicates a serious
> problem -- very likely hardware-related (or rather, at a lower level
> than ZFS).
>
> You've read this already, but maybe you missed the DMA error part:
>
>
http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues
>
> The DMA errors can actually be legitimate too -- it's very hard to
> troubleshoot if they're superfluous (e.g. a FreeBSD bug) or if they're
> real. If the problem is reproducable, then this is convenient with
> regards to providing you additional help.
>
> I really need to sit down and write a huge HOWTO doc for people on how
> to diagnose whether or not their disks or cables are bad, etc... It's a
> very hard thing to document, because everyone's situation is different.
>
> The first piece to start with is simplest, though: install
> ports/sysutils/smartmontools and provide the output of "smartctl -a
> /dev/ad0" and /dev/ad2. Actual disk errors will very likely show up
> there in one of the counters, or in the SMART log. I'd personally like
> to see the output from smartctl, because it's something you can do while
> the system is up/working.
>
> The next step would involve replacing your cables. If the problem
> continues, you've at least removed one piece of the puzzle.
>
> Next, replace the disks -- especially if they were bought at the same
> time, and are from the same vendor. Hard disk vendors are known to have
> bad batches of disks. For sake of example, I just had two Western
> Digital disks (which I bought at the same time) fail a short I/O test,
> returning errors at different LBAs (blocks). The 2nd one only started
> showing problems a few weeks after the first. I obviously got both of
> them RMA'd.
>
> Finally, replace the controller or motherboard. Some people have
> reported success with this.
>
> > generally the lockups don't result in a panic (at least not in the short
> > term of 5-10 minutes), so i can't be sure that this panic is necessarily
> > caused by the same problem, but thought it might be worth posting in case
> > it gives an indication of the location/cause of the deadlock
>
> The DMA timeout errors you've seen, others have seen as well --
> including me -- even when the hardware, disks, cabling, and controllers
> are in a 100% working state. (Even switching OSes results in no errors,
> indicating there is a problem with FreeBSD in some way.)
>
> If the problem is reproducable, you should get in contact with Scott
> Long and let him poke at things. (I mentioned this last time. :-) )
> I myself am not familiar with the FreeBSD kernel, the device drivers, or
> working with the kernel at such a low level to debug things of this
> nature.
>
> > unfortunately i couldn't get a backtrace or core dump for 'political'
> > reasons (the system was required for use by others) but i'll see if i can
> > get a panic happening after-hours to get some more info...
>
> I can't tell you what to do or how to do your job, but honestly you
> should be pulling this system out of production and replacing it with a
> different one, or a different implementation, or a different OS. Your
> users/employees are probably getting ticked off at the crashes, and it
> probably irritates you too. The added benefit is that you could get
> Scott access to the box.