Re: ZFS melting under postgres...

Previous thread: pending changes for TOE support by Kip Macy on Wednesday, December 12, 2007 - 5:03 pm. (17 messages)

Next thread: [head tinderbox] failure on powerpc/powerpc by FreeBSD Tinderbox on Wednesday, December 12, 2007 - 9:02 pm. (1 message)
To: FreeBSD Current <freebsd-current@...>
Date: Wednesday, December 12, 2007 - 7:17 pm

[Empty message]
To: Peter Losher <Peter_Losher@...>
Cc: FreeBSD Current <freebsd-current@...>
Date: Tuesday, January 22, 2008 - 5:45 am

It is hard for me to believe that this is FreeBSD-specific bug, because
checksumming is below FreeBSD-specific code. Of course everything is
possible, but I just think it's just unlikely.

I'd start from configuring UFS on top of GELI with authentication. GELI
will also detect silent data corruptions:

# geli init -a hmac/md5 -e null -s 4096 -P -K /dev/null /dev/ad4
# geli attach -p -k /dev/null /dev/ad4
# dd if=3D/dev/zero of=3D/dev/ad4.eli bs=3D1m (this will take a while)
# newfs -U /dev/ad4.eli
# mount -o noatime /dev/ad4.eli /mnt/tmp
Try your DB test on this file system.

--=20
Pawel Jakub Dawidek http://www.wheel.pl
pjd@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

To: Peter Losher <Peter_Losher@...>
Cc: FreeBSD Current <freebsd-current@...>
Date: Wednesday, December 12, 2007 - 10:55 pm

Try turning of zil, whilst I don't use a db, I have zfs under high load.
I've found without zil turned off I see checksum corruption as well:

/boot/loader.conf

vfs.zfs.zil_disable=1

Cheers,
Benjamin
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Cc: Benjamin Close <Benjamin.Close@...>
Date: Wednesday, December 12, 2007 - 10:58 pm

Wouldn't it be a bad idea to disable ZIL ?

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Dis...

Regards,

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Hugo Silva <hugo@...>
Cc: <freebsd-current@...>
Date: Thursday, December 13, 2007 - 12:25 am

A good read is:

http://blogs.sun.com/perrin/entry/the_lumberjack

Which shows why zil exists.

Cheers,
Benjamin
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Benjamin Close <Benjamin.Close@...>
Cc: <freebsd-current@...>, Hugo Silva <hugo@...>
Date: Thursday, December 13, 2007 - 9:59 am

So does anybody know of a battery backed NVRAM card that can be used
with FreeBSD that the ZIL could be offloaded to?

--
DaveD

To: David Duchscher <daved@...>
Cc: <freebsd-current@...>, Benjamin Close <Benjamin.Close@...>, Hugo Silva <hugo@...>
Date: Friday, December 14, 2007 - 8:57 am

Any CF card or similar will do. You don't need battery backup for
flash memory.

DES
--
Dag-Erling Smørgrav - des@des.no
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Dag-Erling Smørgrav <des@...>
Cc: <freebsd-current@...>, Benjamin Close <Benjamin.Close@...>, Hugo Silva <hugo@...>
Date: Friday, December 14, 2007 - 11:21 am

I did think of that but is a CF card faster than a good SAS or SATA =20
drive? Fastest ones I found have a top rating of 45MB/s. The one =20
battery backed NVARM card that showed up in a google search had a =20
peak rate of 533MB/s. The question seems moot though since FreeBSD =20
doesn't currently support them.

Thanks for your time,
--
DaveD

To: <freebsd-current@...>
Date: Friday, December 14, 2007 - 3:29 pm

Not in transfer rate, but it could help hugely with seek-intensive IO
loads (since seeks are instantaneous on flash or other solid-state
drives). In theory, they could be of immense benefit for databases and
seek-intensive operations on file systems, but the limited bulk transfer
rates and relatively small sizes (for decent money) currently prevent
their wide-spread use.

It would be logical to use a limited size SSD for something like a file
system journal for a large file system, except that these kind of

If a NVRAM or SSD, or other technology presents the drive as a (S)ATA
drive, there's no reason it shouldn't.

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Friday, December 14, 2007 - 11:02 pm

That's no longer true. You can't get more than 5-10MB/s from
seek-intensive RAID0 with two 15K drives, while 20-30MB/s is not a
problem for the comparable priced/sized SSD drive.

-Maxim
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Saturday, December 15, 2007 - 6:20 am

[Empty message]
To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Saturday, December 15, 2007 - 9:30 pm

You can get 64GB SSD now below US$1K, which is comparable price-wise to
RAID0 with two 70GB 15K SAS drives. For example:

http://accessories.us.dell.com/sna/productdetail.aspx?sku=341-5582&cs=04...

-Maxim
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Saturday, December 15, 2007 - 11:57 am

Kingston CF Elite, 20 / 25 MBps write / read
Kingston CF Ultimate, 40 / 45 MBps write / read

SanDisk Extreme III CF, 20 MBps
SanDisk Extreme IV CF, 45 MBps

Sony CF 300X, 45 MBps

These are just a few of those available from my regular supplier.

DES
--
Dag-Erling Smørgrav - des@des.no
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Saturday, December 15, 2007 - 12:12 pm

These are all "normal" CompactFlash cards, for which the widely
available size seems to be 16 GB max, right? I was thinking about
something more like this:
http://gizmodo.com/gadgets/peripherals/adatas-128gb-solid-state-drive-see=
s-the-light-of-day-231693.php
or this: http://www.mtron.net/English/Product/pc_msd1000.asp

Did you (or anyone) deploy CF drives for production servers?

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, December 16, 2007 - 10:51 am

So? That's more than enough for a ZFS intent log (as a rule of thumb,

My router (and DNS, NTP and DHCP server) is a net4801 with a 1 GB CF
chip.

DES
--
Dag-Erling Smørgrav - des@des.no
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Dag-Erling Smørgrav <des@...>
Cc: <freebsd-current@...>
Date: Sunday, December 16, 2007 - 10:57 am

[Empty message]
To: <freebsd-current@...>
Date: Saturday, December 15, 2007 - 2:15 pm

If you're using compact flash for something that's constantly updated
like a ZIL, wouldn't your CF card die real quick?

I've deployed CF in production, but as a read-only medium with
occasional writes only for configuration updates.

From what I understand the specialized expensive solid-state drives
that you guys are discussing are better designed for this type of write
duty whereas CF would probably not last very long.

Since a ZIL is not really seek-intensive, why not just offload it to its
own standard hard disk that has its write caching and all other similar
data-corrupting technologies disabled?
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Saturday, December 15, 2007 - 6:04 pm

Yes. I don't see a point writing a log that's mostly sequantially
accessed on a SSD, and which probably wears the same areas on the drive.
I'm more interested in loads like databases.

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Saturday, December 15, 2007 - 10:42 pm

CF and the flash based SSD drives rotate the flash cells anyway, so it
doesn't matter that much if you write the same block or not.
I wouldn't worry about wearing out those devices, since todays media

I wouldn't do both with them unless required for a specific reason.
The problem is how they work.
They contain NAND flash chips which have two data areas containing
data blocks of typically slightly more than 4 or 8kB these days.
One area is 100% error free with high write rate, but small and the
other is of much less quality, but large.
Devices use the later for the offered data blocks and the good cells
for maintening allocation of them.
One problem is with the data blocks beeing that big, when writing
512 Byte you effectifly do a read-modify-write of a larger physical
block.
This can be handled quite well with larger FS block.
The much bigger problem is with power loss when writing such a
maintenence block.
You loose a very large area of logical blocks when this fails,
since a 4k maintenence block contains the allocation for several hundert
kB of logical data blocks.
In other words - you possibly loose data blocks that were not written
a long time and the database wouldn't expect a problem with that data.
Even for ZIL it is very questionable if you loose a large data area,
since the purpose is to have the data that was already sinced readable
after a power loss.
I'm not sure what happens in case of a device reset in the wrong moment,
possibly this depends on the specific media, but I wouldn't be surprised
to see read errors after a reset without power loss as well.
This is true with all NAND based flash media, SD, MMC, SM, CF, ...
There are medias which are less critical because of the way they utulize
the maintenance blocks, but those things are usually a secret to the
vendor.
I do run PostgreSQL on SD media with ARM based FreeBSD systems, but
I'm prepared to loose the whole database and to recover it from backup
if things go wrong.

--
B.Walter [ message continues ]

" title="http://www...">http://www...

To: <ticso@...>
Cc: <freebsd-current@...>, Ivan Voras <ivoras@...>
Date: Sunday, December 16, 2007 - 5:40 am

Bernd Walter wrote:
...

ZFS doesn't suffer from this problem because the design
is to always write a new section of data rather than
over write "current" data.

So if you lose power in the middle of a write to a data
block, there is no damage to the old data.

Darren
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Dag-Erling <des@...>
Cc: David Duchscher <daved@...>, <freebsd-current@...>, Benjamin Close <Benjamin.Close@...>, Hugo Silva <hugo@...>
Date: Friday, December 14, 2007 - 9:08 am

THis may also have to wait for a future version of ZFS. I remember
reading about this kind of thing as an upcoming feature in Solaris. I
believe the way this feature would work is that ZFS would allow creating
the ZIL on a different pool to the filesystem - i.e. create a zpool on
the CF card and get the ZIL to live there.

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Doug Rabson <dfr@...>
Cc: David Duchscher <daved@...>, <freebsd-current@...>, Benjamin Close <Benjamin.Close@...>, Hugo Silva <hugo@...>
Date: Friday, December 14, 2007 - 9:25 am

AFAIK this is already implemented in ZFS, though I'm not sure Pawel has
merged it into FreeBSD yet.

Note that you can also get disk drives with a certain amount of NAND
flash built-in, but FreeBSD doesn't support that yet.

DES
--
Dag-Erling Smørgrav - des@des.no
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Friday, December 14, 2007 - 9:47 am

http://www.opensolaris.org/os/community/zfs/version/7/

You simply add a 'log' vdev to a pool. It's included in snv_75, maybe
earlier.

/Kenneth
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Friday, December 14, 2007 - 9:46 am

They are intended for use with Vista - and have recently been found to
be marginally more effective than a placebo (for Vista-performance).
The speed-gains are barely distinguishable from measurement-errors...
So, if the drives would help ZFS, it would be a big irony.

There are companies that manufacture "pure" SSDs (www.superssd.com,
www.soliddata.com) with battery-backup.
Unfortunately, the price-tag of these systems is still beyond reach for
normal customers.

cheers,
Rainer
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Hugo Silva <hugo@...>
Cc: <freebsd-current@...>, Benjamin Close <Benjamin.Close@...>
Date: Wednesday, December 12, 2007 - 11:00 pm

Yes. However, FreeBSD suffers from deadlocks under load if ZIL is enabled.

-Kip
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Cc: Kip Macy <kip.macy@...>, Benjamin Close <Benjamin.Close@...>, Hugo Silva <hugo@...>
Date: Friday, December 14, 2007 - 2:20 pm

> Yes. However, FreeBSD suffers from deadlocks under load if ZIL is enabled.

Is there some ML post / documentation on this? I am trying to keep up-to-da=
te=20
on ZFS status (on FreeBSD and otherwise), but I don't think this has been=20
discussed on any of the usual mailing lists.

=2D-=20
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org

To: <freebsd-current@...>
Date: Friday, December 14, 2007 - 3:32 pm

Should I open a page on wiki.freebsd.org to account the ZFS-related bugs?=
:)

To: <freebsd-current@...>
Date: Thursday, December 13, 2007 - 1:26 pm

ed.

Do you know how such deadlocks manifest? Do they perhaps result in a
process locked in "zfs" wchan (not "zfs:&...")?

To: Kip Macy <kip.macy@...>
Cc: <freebsd-current@...>, Hugo Silva <hugo@...>
Date: Thursday, December 13, 2007 - 12:21 am

It also comes down to what your doing. ZFS is always consistent on disk.
ZIL provides the journal between the last pool transaction write and
what has changed since that write. Either way zfs will come up cleanly
after a power failure, it's just whether you have those last few sync's
or not.
For the application I'm using zfs for (rsynced backups, snapshoted
daily) that'll be corrected the next day anyway. For a DB, this could be
a show stopper.

Cheers,
Benjamin
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

Previous thread: pending changes for TOE support by Kip Macy on Wednesday, December 12, 2007 - 5:03 pm. (17 messages)

Next thread: [head tinderbox] failure on powerpc/powerpc by FreeBSD Tinderbox on Wednesday, December 12, 2007 - 9:02 pm. (1 message)