Re: I/O errors on Hammer volume

Previous thread: HEADS UP - wireless driver work going into master, some drivers will stop working temporarily. by Matthew Dillon on Monday, April 12, 2010 - 10:03 am. (1 message)

Next thread: HEADS UP: I/O scheduler (dsched) now in master by Alex Hornung on Thursday, April 15, 2010 - 1:44 pm. (4 messages)
From: Francois Tigeot
Date: Monday, April 12, 2010 - 11:35 pm

I've just found this in the daily security output of my workstation
(running 2.6.1) :

Checking setuid files and devices:
find: /usr/pkgsrc/sysutils/sysinfo/CVS/Entries: Input/output error
find: /usr/pkgsrc/devel/avltree/CVS/Entries: Input/output error
find: /usr/pkgsrc/security/p5-Crypt-RSA/CVS/Entries: Input/output error
find: /usr/pkgsrc/ham/dpbox/CVS/Entries.Log: Input/output error
find: /usr/pkg/openoffice.org: Input/output error
find: /usr/pkg/openoffice.org3: Input/output error

These directories can no longer be used:

# cd /usr/pkg
# ls -ld o*
ls: openoffice.org: Input/output error
ls: openoffice.org3: Input/output error

I'm not sure if this is due to a bug in HAMMER or to some sort of hardware
failure. The hard disk is a 500GB Western Digital Green model with a single
hammer version 4 volume. The directories belong to the root fs.

Filesystem                    Size   Used  Avail Capacity  Mounted on
PFS500GP                      460G   400G    60G    87%    /

There's no relevant error message in dmesg.

Is there any tool I can use to diagnose and fix this problem ?

-- 
Francois Tigeot
From: Michael Neumann
Date: Tuesday, April 13, 2010 - 2:36 am

Check with sysutils/smartmontools. I'd love to have a simplistic SMART
monitoring daemon integrated into DragonFly which is run by default.
This would be so cool and useful!

Regards,

  Michael
From: Petr Janda
Date: Tuesday, April 13, 2010 - 3:39 am

I'd be all for that too. Actually, I'd be all for going one step further and 
write a program(or if one like that already exists, put it in base) that 
checks/monitors the health of the hardware. Bad hardware (drives, memory etc) 
are sneaky fuckers that tend to bite when one doesn't expect it. I dont know 
how one would go about making such software but going forward, a cluster of 
ie. 32 DF servers would be difficult to monitor for health and such software 
would just make it much easier. In my opinion.

Petr
From: Francois Tigeot
Date: Tuesday, April 13, 2010 - 3:54 am

Thanks, this could be a very nice utility but I'm afraid the output is bit too
cryptic for my tastes.

I have included the most relevant part of the output of
"smartctl -d sat -a /dev/da0" below:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always     -       0
  3 Spin_Up_Time            0x0027   157   148   021    Pre-fail  Always     -       3116
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always     -       135
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always     -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always     -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always     -       2218
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always     -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always     -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always     -       133
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always     -       131
193 Load_Cycle_Count        0x0032   147   147   000    Old_age   Always     -       160820
194 Temperature_Celsius     0x0022   109   093   000    Old_age   Always     -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always     -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always     -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline    -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always     -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline    -       0

-- 
Francois Tigeot
From: Matthew Dillon
Date: Tuesday, April 13, 2010 - 11:49 am

That's all normal except the Load_Cycle_Count is excessive.  But
    the Load_Cycle_Count is somewhat dependent on the manufacturer
    and I don't see any errors reported in the output.

    Another thing you can do is run 'hammer -f <device> show | less -B'
    and search for lines beginning with the character 'B'.  You would
    type slash-carrot-B into LESS.  The hammer show command checks
    for problems.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From: Francois Tigeot
Date: Tuesday, April 13, 2010 - 10:37 pm

There are many such lines. I have copied the first few ones:

BM   NODE 800000268ea27000 cnt=33 p=800000268ea12000 type=L depth=4 mirror
000000013a77e4d0 fill=z8:19741=60% {

BM   NODE 800000268ea31000 cnt=33 p=800000268ea12000 type=L depth=4 mirror
0000000139b6e140 fill=z8:19741=60% {

BM   NODE 800000268eb0e000 cnt=32 p=800000268eb02000 type=L depth=4 mirror
0000000148774380 fill=z8:19741=60% {

BM   NODE 800000268eb19000 cnt=32 p=800000268eb02000 type=L depth=4 mirror
0000000139b6e840 fill=z8:19741=60% {

-- 
Francois Tigeot
From: Matthew Dillon
Date: Tuesday, April 13, 2010 - 11:02 pm

If the only flag after B is 'M' then it is ok, it's the mirror-tid
    bug which was fixed in 2.6 but only effects incremental mirroring.
    This is self-correcting so it isn't a concern.

    If other flags are present then we have a real problem.

    If the broken inodes are localized it could be stuff left dangling
    from a crash recovery.  That's my hope.  Did you have any crashes
    while work was occuring on any of the related directories?  If this
    is the case it should be possible to remove them with 'rm'.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
From: Francois Tigeot
Date: Tuesday, April 13, 2010 - 11:25 pm

Ok. I'm creating a complete log of the hammer show output. I'll grep for the

Nope. The errors appeared out of the blue.
At one time I suddenly found garbled contents in mails I was reading with

This is not possible :
# rm iconv.dir
rm: iconv.dir: Input/output error

-- 
Francois Tigeot
From: Matthew Dillon
Date: Thursday, April 15, 2010 - 11:48 am

Francois, did the filesystem you started getting the I/O errors on
    fill up at some point.  I noticed that it looks very full.

    That might be the common thread here between you and Jan.

						-Matt
From: Francois Tigeot
Date: Thursday, April 15, 2010 - 12:05 pm

It never did.

I think I had only about 20 GB free at some point, but then the 60-day
snapshot limit kicked in and free space began to grow.

I never did get this error before, even on test filesystems I managed to fill.

-- 
Francois Tigeot
From: Matthew Dillon
Date: Thursday, April 15, 2010 - 8:52 pm

I have pushed a new hammer directive to HEAD called 'checkmap'.
    It isn't complete yet but it should be good enough to verify that
    all record data references are accounted for in the blockmap.  I still
    need to have it check B-Tree node allocations and other stuff.

    hammer -f <device> checkmap

    If it does not output any blockmap info then everything that it checks
    (so far) is ok.

    If you specify verbose (-v) it will output the blockmap info it
    collects whether good or bad.

    The directive scans the B-Tree so it will take a little while, but
    nowhere near as long as 'show' because it doesn't dive the record data.

    Also in the same commit I pushed some changes to the 'show' directive
    to hopefully detect and report zone mismatches instead of assert.

    --

    Hopefully this will narrow down the problem space.  I'd like you to
    run both the hammer show and hammer checkmap operations with this
    latest hammer utility and report back.  bzip2'ing the output and
    making it available would also be helpful.

						-Matt

From: Matthew Dillon
Date: Friday, April 16, 2010 - 2:19 pm

At around the time when you noticed the problem, and/or a few days before
    or after, did you run a manual hammer cleanup and is it possible (if you
    did) that it might have been running at the same time as another hammer
    cleanup, such as the one run by cron?

						-Matt
From: Matthew Dillon
Date: Friday, April 16, 2010 - 3:02 pm

I think I found the smoking gun but I can't be sure until I see
    the show output from Francois.

    I noticed that all the bad CRC'd records were typically around element
    30-32 in the B-Tree node (out of ~64 elements).  That is, the middle
    of the node.

    This implies a race between the reblocker/rebalancer and a node split
    during an insertion, or a race between the reblocker and the rebalancer.

    I am testing a fix now and I am not 100% sure that this was the issue,
    but there are a lot of things pointing to it:

    * In both Jan's and Francois's cases the inodes that got corrupted
      were in areas of the filesystem under a heavy write/create/delete
      load.

    * The corrupted records appear to nearly always be in the middle of
      the B-Tree node, which implies a race against an insertion or a
      rebalancing operation while the reblocker is running.

    * And I found a bug in the reblocker that was exposed by recent work
      (the work itself was not buggy, it just exposed the bug that already
      existed) whereby the reblocker may reblock an element after relocking
      the node but without properly checking that the element is still valid.

    Jan, I think you can test this with your psql test, after you reformat
    that volume for real and start fresh.  You should be able to test this
    by running a continuous hammer reblock operation on the data while you
    are running the database test and see if corruption ultimately occurs.

    I have included my proposed patch/fix below but please do not apply
    it yet.  I want to try to reproduce the corruption here to actually
    test whether this fixes the issue or not.

    Once we fix the issue I'll have to work up a procedure to fix any
    broken filesystems.  Locating breakage is really easy, the hammer
    show and hammer checkmap commands can be used.  Fixing it, short
    of copying off the filesystem, may be more difficult.

    Jan, I am convinced that it is ...
From: Francois Tigeot
Date: Saturday, April 17, 2010 - 8:08 am

It never occured.
The emachine was left to run at night; most of the time in the last 15 days
the cleanup ware run unattended at 3am.

I did a manual cleanup once but it was after a complete 24h shutdown and I'm
sure no periodic cron ran at the same time. I would have noticed the
performance degradation.

-- 
Francois Tigeot
From: Matthew Dillon
Date: Tuesday, April 13, 2010 - 11:44 am

This is the first time I've seen this sort of problem reported.

    There are two possibilities.  First is it is a real I/O error, but
    in that case I would have expected messages in the dmesg output.
    The second is the HAMMER volume somehow got corrupted... HAMMER
    will generate EIO if records in the B-Tree specify offsets which
    are out of bounds.  If it were real corruption the CRCs would
    fail and you would get dmesg output.  If something else happened
    that caused it to lay down good records that point to bad data
    then it could report EIO without generating console output.

    One thing you can do is see if the disk is completely readable
    by using dd to read the raw drive. It should be able to read the
    whole disk until it hits the end of the disk:

	dd if=/dev/blahblah of=/dev/null bs=32k

    But beyond that, think back about things you might have done
    in the past that is outside the norm for HAMMER operations
    that might have caused the problem, such as add or remove
    a volume or adjust the disklabel or something like that.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
From: Francois Tigeot
Date: Tuesday, April 13, 2010 - 10:33 pm

It took almost 2 hours to run but found no errors.

Nothing so extreme.

I have updated the system to the latest 2.6 a few days ago. I think the
previous version didn't have the new REDO code.

I also briefly ran a DragonFly-2.6/x86-64 test installation from another disk
and used the existing 500GB drive /home PFS. The error messages appeared the
following day.

-- 
Francois Tigeot
From: Matthew Dillon
Date: Tuesday, April 13, 2010 - 10:56 pm

Hmm.  Maybe there is an issue with 64-bit vs 32-bit use of the same
    HAMMER filesystem.  That hasn't been tested well at all, though
    theoretically it should work just fine.

    I think what I need is a full 'hammer -f <device> show' dump of the
    filesystem.  It will be large, but it should be able to bzip2
    reasonably well.  If you can upload that to leaf I can take a look
    at it.  I also need a list of (a few of the) broken filenames that
    I can track down, including their inode numbers (ls -lia).

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
From: Francois Tigeot
Date: Tuesday, April 13, 2010 - 11:31 pm

I'm not able to create a complete dump:

# hammer -f /dev/da0s1d show > hammer-show.dump
Assertion failed: (layer2->zone == zone), function blockmap_lookup, file /usr/src/sbin/hammer/blockmap.c, line 104.

I can't get their inode numbers either:

# ls -lia iconv.dir
ls: iconv.dir: Input/output error

-- 
Francois Tigeot
Previous thread: HEADS UP - wireless driver work going into master, some drivers will stop working temporarily. by Matthew Dillon on Monday, April 12, 2010 - 10:03 am. (1 message)

Next thread: HEADS UP: I/O scheduler (dsched) now in master by Alex Hornung on Thursday, April 15, 2010 - 1:44 pm. (4 messages)