Re: Should we simply disallow ZFS on FreeBSD/i386?

Previous thread: Strange kernel trap 12 with vm_page_splay() on FreeBSD/i386 SMP 7.0-RC1 by Xin LI on Thursday, January 3, 2008 - 9:55 pm. (1 message)

Next thread: 7.0-PRERELEASE installworld fails by Unga on Friday, January 4, 2008 - 7:58 am. (1 message)
To: <freebsd-current@...>
Date: Friday, January 4, 2008 - 7:42 am

Hi,

As far as I know about the details of implementation and what would it
take to fix the problems, is it safe to assume ZFS will never become
stable during 7.x lifetime?

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 10:04 am

Have you heard of the logical fallacy called "plurium interrogationum"?
You may not be familiar with the phrase (which is Latin for "multiple
questions"), but it's what you're doing here: asking a question which is
impossible to answer truthfully because it is based on an incorrect
premise, and to answer the question correctly you must first discuss the
premise. It's a favorite Hollywood plot device, because you can have
the smart-aleck lawyer interrupt the confused witness and insist on a
yes or no answer, forcing the witness to implicitly agree with the
premise. I doubt it would work in a real-life court, though, because
judges tend to be smart people. But I digress.

Your question is based on the premise that ZFS in FreeBSD 7 is unstable.
That premise is false. There are issues with auto-tuning of certain
parameters, which can cause kmem exhaustion, but they are easily worked
around by setting a few tunables. It has worked very well for me
(raidz, 1.2 TB pool, 4 GB RAM, ~60 file systems and twice as many
snapshots) after I added the following lines to loader.conf:

vm.kmem_size="1G"
vfs.zfs.arc_min="64M"
vfs.zfs.arc_max="512M"

DES
--
Dag-Erling Smørgrav - des@des.no
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Dag-Erling Smørgrav <des@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 10:37 am

T24gMDcvMDEvMjAwOCwgRGFnLUVybGluZyBTbcO4cmdyYXYgPGRlc0BkZXMubm8+IHdyb3RlOgoK
PiBZb3VyIHF1ZXN0aW9uIGlzIGJhc2VkIG9uIHRoZSBwcmVtaXNlIHRoYXQgWkZTIGluIEZyZWVC
U0QgNyBpcyB1bnN0YWJsZS4KPiBUaGF0IHByZW1pc2UgaXMgZmFsc2UuCgpBdCBtb3N0LCB3ZSds
bCBoYXZlIHRvIGFncmVlIHRvIGRpc2FncmVlLiBBICJ0dW5pbmciIG9mIHRoZSBzeXN0ZW0gKGF0
CmxlYXN0IGZyb20gbXkgZXhwZXJpZW5jZSkgaXMgYWJvdXQgc3lzdGVtIHBlcmZvcm1hbmNlLCBu
b3Qgd2hldGhlciB0aGUKc3lzdGVtIHdpbGwgY3Jhc2ggb3Igbm90LiBZb3UgbWF5IGRlZmluZSB0
aGUgd29yZCB0byBtZWFuIHNvbWV0aGluZwplbHNlIGJ1dCB0aGF0J3MgeW91ciB0aGluZy4KClRo
ZSByZWFzb24gSSdtIGFnZ3Jlc3NpdmVseSBkaXNjdXNzaW5nIHRoaXMgaXMgdGhhdCBsYWJlbGlu
ZyB0aGUKcHJvYmxlbSBhcyAidHVuaW5nIiB3aWxsLCBmb3IgYW55IG5vbi10cml2aWFsIHRhc2sg
d2hpY2ggaGFzIHNvbWUKZ3Jvd3RoIGluIHN5c3RlbSBsb2FkLCByZXN1bHQgaW4gYSBzZXJ2ZXIg
dGhhdCBuZWVkcyBjb25zdGFudCB0dW5pbmcKanVzdCB0byBzdXJ2aXZlIGFub3RoZXIgZGF5LiBX
aGF0IGlzIHR1bmVkIHRvZGF5IG1heSBhcyB3ZWxsIHJlc3VsdCBpbgphIGNyYXNoIHRvbW9ycm93
IGlmIHRoZSBsb2FkIHJpc2VzLiBXZWIgc2VydmVycyBhcmUgbm90b3Jpb3VzIGZvciB0aGlzCih0
aG91Z2ggb3RoZXIgdHlwZXMgaGF2ZSBvZiBjb3Vyc2Ugc2ltaWxhciBiZWhhdmlvdXIpIC0gYQoi
c2xhc2hkb3R0aW5nIiBvZiBhICJwcm9wZXJseSB0dW5lZCIgRnJlZUJTRCBzeXN0ZW0gd2l0aCBa
RlMgd2lsbCBub3QKcmVzdWx0IGluIGEgc2xvd2Rvd24gLSBpdCB3aWxsIHJlc3VsdCBpbiB0aGUg
c3lzdGVtIGNyYXNoaW5nLiBUaGlzIGlzCm5vdCBhY2NlcHRhYmxlLCBhbmQgdGhlcmVmb3JlIGRp
c21pc3NpbmcgaXQgYXMgImp1c3QgdHVuaW5nIiBpcwpjb3VudGVycHJvZHVjdGl2ZSBhbmQgYmFk
IGVuZ2luZWVyaW5nLgo=

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 4:46 pm

ZFS is clearly marked as experimental so its reasonable to require tuning
to avoid crashes. If its still the case when the experimental status is
lifted then you can have this argument all over again.

cheers,
Andrew
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 5:59 am

To sum up this thread, let me present ZFS status as of today.

Before I do that, one explanation. I was away from FreeBSD for like 3-4
weeks, because of real life issues, etc. I hope, I'm now back for good.
Let me also use this again to invite any interested committers to help
working on ZFS (I'm inviting people to help from a day one).

Ok.

The most pressing issues currently are:
1. kmem_map exhaustion.
2. Low memory deadlocks in ZFS itself.

I believe 2nd problem is already fixed in OpenSolaris, at least that was
my impression when I made last integration, I need to double check. If
that's true, I'll try to commit the fix before 7.0-RELEASE.

The 1st problem has of course much wider audience. First of all you
need:

http://people.freebsd.org/~pjd/patches/vm_kern.c.2.patch

The patch is not yet committed, because I was discussing better
solutions with alc@. I don't think we (he) will be able to come up with
something better before 7.0-RELEASE, so I'm going to ask re@ for
approval for this patch today. Note that it is low risk change, because
it is executed only in situation where the system will panic anyway.

Of course it is so much better to use ZFS on 64bit systems, but it also
works on i386. I'm running ZFS in production for many months on two i386
systems. One has 1GB memory and those tunning in loader.conf:

vfs.zfs.prefetch_disable=3D1
vm.kmem_size=3D671088640
vm.kmem_size_max=3D671088640

I've three ZFS pools in here, no UFS at all. The load is rather light,
serving large files. No panics.

The second "production" box is my laptop. I've 2GB of RAM (it worked
fine with 1GB too), but I do have 'options KVA_PAGES=3D512' in my kernel
config and my loader.conf looks like this:

vm.kmem_size=3D1073741824
vm.kmem_size_max=3D1073741824
vfs.zfs.prefetch_disable=3D1

My laptop is ZFS-only. No panics whatsoever.

The box I'm running ZFS for the longest time is amd64 system with 1GB of
RAM. This box is used for backups (ZFS snapshots are so damn handy) and
guess...

To: Pawel Jakub Dawidek <pjd@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 6:30 am

I'd suggest we do give all three warnings (KVA_PAGES, kmem_size, i386)
at once, preferably both when the ZFS module loads and when a zpool is
created. I think it's important that the tree pieces of information be
given at the same time so the user doesn't need to hunt solutions
after panics.

Your comment that people are panicking more than ZFS is correct, but
that illustrates the importance people give to having file system not
crash on them :)
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>, <ivoras@...>, <pjd@...>
Date: Tuesday, January 8, 2008 - 1:58 pm

Ivan Voras wrote:
> Pawel Jakub Dawidek wrote:
>
> > Let try to think how we can warn people clearly about proper tunning and
> > what proper tunning actually means. I think we should advise increasing
> > KVA_PAGES on i386 and not only vm.kmem_size. We could also warn that
> > running ZFS on 32bit systems is not generally recommended. Any other
> > suggestions?
>
> I'd suggest we do give all three warnings (KVA_PAGES, kmem_size, i386)
> at once, preferably both when the ZFS module loads and when a zpool is
> created. I think it's important that the tree pieces of information be
> given at the same time so the user doesn't need to hunt solutions
> after panics.

How about including the URL of the ZFS tuning guide in the
warning message:

http://wiki.freebsd.org/ZFSTuningGuide

It contains all the necessary information for both i386 and
amd64 machines. It can also easily be updated if necessary
so people always get the most up-to-date information.

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd

"Documentation is like sex; when it's good, it's very, very good,
and when it's bad, it's better than nothing."
-- Dick Brandon
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>, <ivoras@...>, <pjd@...>
Date: Saturday, January 12, 2008 - 5:45 pm

Pawel said in Nov:

The Wiki should be changed. Allow ZFS to autotune it, don't tune it by
hand.
-----

Yet the wiki still recommends hand tuning?
Cheers.

--
Mark Powell - UNIX System Administrator - The University of Salford
Information Services Division, Clifford Whitworth Building,
Salford University, Manchester, M5 4WT, UK.
Tel: +44 161 295 6843 Fax: +44 161 295 5888 www.pgp.com for PGP key
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Cc: <pjd@...>, <ivoras@...>
Date: Tuesday, January 8, 2008 - 2:24 pm

Actually, it fails to mention the most important bit: vfs.zfs.arc_max,
which allows you to restrict the amount of memory used by ZFS to
something comfortably smaller than vm.kmem_size.

DES
--
Dag-Erling Smørgrav - des@des.no
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Dag-Erling Smørgrav <des@...>
Cc: <freebsd-current@...>, <pjd@...>, <ivoras@...>
Date: Wednesday, January 9, 2008 - 1:39 pm

It was in the ZFS tunning guide, but was removed in revision 20.
Doesn't say why the change was made.

Scot
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Dag-Erling Smørgrav <des@...>
Cc: <freebsd-current@...>, <pjd@...>
Date: Tuesday, January 8, 2008 - 7:16 pm

T24gMDgvMDEvMjAwOCwgRGFnLUVybGluZyBTbcO4cmdyYXYgPGRlc0BkZXMubm8+IHdyb3RlOgoK
PiBBY3R1YWxseSwgaXQgZmFpbHMgdG8gbWVudGlvbiB0aGUgbW9zdCBpbXBvcnRhbnQgYml0OiB2
ZnMuemZzLmFyY19tYXgsCj4gd2hpY2ggYWxsb3dzIHlvdSB0byByZXN0cmljdCB0aGUgYW1vdW50
IG9mIG1lbW9yeSB1c2VkIGJ5IFpGUyB0bwo+IHNvbWV0aGluZyBjb21mb3J0YWJseSBzbWFsbGVy
IHRoYW4gdm0ua21lbV9zaXplLgoKUGF3ZWwsIGlzIGl0IHJlY29tbWVuZGVkPwoKSWYgaXQgaXMs
IEknbGwgYWRkIGl0IHRvIHRoZSBwYWdlLgo=

To: Ivan Voras <ivoras@...>
Cc: Dag-Erling Smorgrav <des@...>, <freebsd-current@...>, <pjd@...>
Date: Wednesday, January 9, 2008 - 1:45 am

With the vm_kern.c.2.patch, it doesn't seem to be an issue, at least
for me. "c" always stays far away from "c_max":

kstat.zfs.misc.arcstats.p: 218885440
kstat.zfs.misc.arcstats.c: 342346436
kstat.zfs.misc.arcstats.c_min: 20971520
kstat.zfs.misc.arcstats.c_max: 503316480
kstat.zfs.misc.arcstats.size: 342342144
vm.kmem_size: 671088640
hw.physmem: 1064771584
vm.kmem_map_panics_avoided: 171

The last sysctl was added by me to track how often the patch saved my
system from a panic :) I suppose lowering arc_max would reduce the
number of times the routine was called, though.

--
Dan Nelson
dnelson@allantgroup.com
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>, <ivoras@...>, <pjd@...>
Date: Tuesday, January 8, 2008 - 1:59 pm

The tuning information belongs in the zfs(8) manual page.

--
Steve
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>, Pawel Jakub Dawidek <pjd@...>
Date: Monday, January 7, 2008 - 9:17 am

Having read the thread and people's reasons for using the ZFS, it does
seem that they are trying to use ZFS to solve non-problem problems,
especially that someone commented that they use 1:10 kmem:HD space
ratio!

Igor :-)
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 11:36 am

I'm not sure if anyone has mentioned this yet in the thread, but another thing
worth taking into account in considering the stability of ZFS is whether or
not Sun considers it a production feature in Solaris. Last I heard, it was
still considered an experimental feature there as well.

Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 12:47 pm

Last I heard, rsync didn't crash Solaris on ZFS :)

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 1:20 pm

[Empty message]
To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 1:36 pm

I can't provide citation about a thing that doesn't happen - you don't=20
hear things like "oh and yesterday I ran rsync on my Solaris with ZFS=20
and *it didn't crash*!" often.

But, with some grains of salt taken, consider this Google results:

* searching for "rsync crash solaris zfs": 790 results, most of them=20
obviously irrelevant
* searching for "rsync crash freebsd zfs": 10,800 results; a small=20
number of the results is from this thread, some are duplicates, but it's =

a large number in any case.

I feel that the number of Solaris+ZFS installations worldwide is larger=20
than that of FreeBSD+ZFS and they've had ZFS longer.

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 4:00 pm

I used zfs on FreeBSD current amd64 around summer 2006 as a
samba-server for internal use on a dual xeon (first generation 64-bit,
somewhat slow and hot) with 4 GB ram and two qlogic hba's attached to
approx. 8 TB of storage. I did not once experience any kernel panic or
other unplanned stop. But I whenever I manually mounted a smbfs-share
the terminal would not return to the command line.

I upgraded in october 2007 and the smbfs-mount returned to the command
line and I thought I was happy. Until I started to get the kmem_map
too small kernel-panics when doing much I/O (syncing 40 GB of small
files). I tuned the values as indicated in the zfs tuning guide and
rebooted and increased the values as the kernel panics persisted. When
I increased the values even more I ended up with a kernel which
refused to boot, boy I was almost getting a panic myself :-)

Applying Pawel's patch did make the server survive two or three 40 GB
rsyncing so the patch did help. But we were approching xmas season
which is a very critical time for us so I migrated to solaris 10. The
solaris server has had no downtime but to conclude that solaris is
more stable in my situation is premature.

--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentlest gamester is the soonest winner.

Shakespeare
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 2:10 pm

Almost all Solaris systems are 64 bit.

Kris
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Kris Kennaway <kris@...>
Cc: <freebsd-current@...>, Ivan Voras <ivoras@...>
Subject: ZFS honesty
Date: Sunday, January 6, 2008 - 5:00 pm

So, let's be honest here. ZFS is simply unreliable on FreeBSD/i386.
There are things that you can do mitigate the problems, and in certain
well controlled environments you might be able to make it work well
enough for your needs. But as a general rule, don't expect it to work
reliably, period. This is backed up by Sun's own recommendation to not
run it on 32-bit Solaris.

But let's also be honest about ZFS in the 64-bit world. There is ample
evidence that ZFS basically wants to grow unbounded in proportion to the
workload that you give it. Indeed, even Sun recommends basically
throwing more RAM at most problems. Again, tuning is often needed, and
I think it's fair to say that it can't be expected to work on arbitrary
workloads out of the box.

Now, what about the other problems that have been reported in this
thread by Ivan and others? I don't think that it can be said that the
only problem that ZFS has is with memory. Unfortunately, it looks like
these "other" problems aren't well quantified, so I think that they are
being unfairly dismissed. But at the same time, maybe these other
problems are rare and unique enough that they represent very special
cases that won't be encountered by most people. But it also tells me
that ZFS is still immature, at least in FreeBSD.

The universal need for tuning combined with the poorly understood
problem reports tells me that administrators considering ZFS should
expect to spend a fair amount of timing testing and tuning. Don't
expect it to work out of the box for your situation. That's not to
say that it's useless; there are certainly many people who can attest to
it working well for them. Just be prepared to spend time and possibly
money making it work, and be willing to provide good problem reports for
any non-memory related problems that you encounter.

Scott
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send...

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 6:20 pm

JFWIW - last night's trial OpenSolaris/Indiana' devel iso installed on Core-2
duo with 2GB created something it reported as 'Z-lite) (IIRC - it wasn;t worht
wasting HDD space on...)

Anyone know if this 'different' on Solaris for i386 from -64?
i.e. - is do Sun use a 'lite' and full' version?

And, if so, [is there | should there be ] an equivalent in the FreeBSd world? or

Clearly so.

So much so that IMNSHO, inclusion of most *remaining* ZFS issues more properly
belongs on the ZFS-specific mailing list.

I don't see much - if any - remaining evidence that there are things either
'wrong' or even sub-optimal with FreeBSD *itself* that only ZFS exposes.

Au contraire - FreeBSD seems to be as accommodating to ZFS needs as can be.

The rest seems to be up to ZFS code, 'sensing' of resources & load, manual &
auto-config, dynamic adjustment - more graceful degradation & recovery.

Whatever.

JM2CW, but the level of 'traffic' on this list in re still-experimental-at-best
ZFS is distracting attention from issues that are more universal, critical to
more users and uses - and more in need of scarce attention 'Real Soon Now'.

It almost begs dismissal of ZFS posts to the bespoke list out-of-hand.

ZFS is still eminently 'avoidable' for now.

Reports of I/O problems, drivers that can corrupt data on *UFS* are a whole
'nuther matter..

Bill

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 6:43 pm

For my part it's because I'm "desperate" for a good file system, and ZFS =

seemed to be "it" for a while. I'd also settle for any other, including=20
a stable version of UFS that's pleasant to work with on TB-sized drives=20
(Sun's UFS? BLUFFS?), XFS, Ext4, LFS, HAMMER, whatever.

I've tried contacting the author of BLUFFS, but without optimistic result=
s.

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 6:58 pm

None are perfect. But ZFS is just *too* new. And not just on *BSD.
If IBM had not already had GPFS, Sun might never even have 'invented' ZFS.

The 'other' ones with the longest 'history' - where known-problems have knwon
avoidance/workaround, may well be XFS and JFS. Heavy-lifters iwht commercial
track-records, both.

Not to mention UFS...

I'm still in the practice of 'slicing' into 50 GB or so - 100GB max - no matter
*what* the drive size is.

So where's the 'beef'?

Half-terabyte *files*? I surely hope not..

At some point too many eggs (files) in one basket just makes b/u restore a
nightmare.

There are no silver bullets.

Drives fail. Controllers fail, and sometimes had done so long before anyone
noticed they were subtly corrupting data. So even RAID arrays and offline b/u
can fail one..

ZFS doesn't 'fix' all that - just approaches a fix in an all-software manner.

Other failings aside, there is an overhead penalty for all the 'handling'.

Coders may believe in that. It's what they do.

I'll take simplicity, redundant hardware. And compartmentalization.

Faster, cheaper, lasts a long time.

And takes more manageable sized chunks out of yerass when it DOES go tits-up.
As that all do.

Bill
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: 韓家標 <askbill@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 8:03 pm

On Sun, 2008-01-06 at 22:58 +0000, 韓家標 Bill Hacker wrote:
Could you by any chance elaborate -- from the information available to
me, I did not get an impression that ZFS is the cluster-aware filesystem
OT: As someone, who has ~10TB of compressed high-fidelity documents in
production (AIX/JFS2), I can tell you that this approach will only take
Not any better then 200 x 50GB filesystems ;)

--
Alexandre "Sunny" Kovalenko

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 10:29 pm

From the Wikipedia article on Lustre...

"...Sun completed its acquisition of Cluster File Systems, Inc., including the
Lustre file system, on October 2, 2007, with the intention of bringing the
benefits of Lustre technologies to Sun's ZFS file system and the Solaris
operating system."

So Sun has had what? 2+ months? to try to fill a ZFS 'hole' that was worth a
major investment? See also traffic on *Sun's* ZFS list.

Far more features than that - 'robust', 'fault tolerant', 'Disaster Recovery'
... all the usual buzzwords.

And nothing prevents using 'cluster' tools on a single box. Not storage-wise anyway.

More importantly - GPFS has just under ten years in the market, and has become a
primary player in Supercomputing as well as video on demand et al.

BTW: UFS(1) / FFS - have very respectable upper-bounds - UFS2 even more so, so
(even) Sun is not totally dependent on ZFS. Unless they choose to become so...

Finally - the principle architect/miracle worker of ZFS on FreeBSD - pjd@ -
seems to be heavily committed on other matters now, and may be so for some time
to come.

Ergo 'caution' remains appropriate for production use w/r ZFS - perhaps until 8.X.

Bill

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: 韓家標 Bill Hacker <askbill@...>
Cc: <freebsd-current@...>
Date: Wednesday, January 9, 2008 - 9:23 am

=?UTF-8?B?6Z+T5a625qiZIEJpbGwgSGFja2Vy?= writes:
> > OTOH that's all GPFS is.
>
> Far more features than that - 'robust', 'fault tolerant', 'Disaster Recovery'
> ... all the usual buzzwords.
>
> And nothing prevents using 'cluster' tools on a single box. Not storage-wise anyway.

Having had the misfortune of being involved in a cluster which used
GPFS, I can attest that GPFS is anything but "robust" and "fault
tolerant" in my experience. Granted this was a few years ago, and
things may have improved, but that one horrible experience was
sufficient to make me avoid GPFS for life.

Drew
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Andrew Gallatin <gallatin@...>
Cc: 韓家標 <askbill@...>, <freebsd-current@...>
Date: Wednesday, January 9, 2008 - 10:55 am

Would you mind sharing your experience, maybe in the private E-mail. I
am especially interested in the platform you have used (as in AIX or
Linux) and underlying storage configuration (as in directly attached vs.
separate file system servers).

I am running few small AIX clusters in the lab using GPFS 3.1 over iSCSI
and so far was fairly pleased with that.

However, OP's point was that ZFS has inherent cluster abilities, of

--
Alexandre "Sunny" Kovalenko

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Alexandre "Sunny" Kovalenko <alex.kovalenko@...>
Cc: <askbill@...>, <freebsd-current@...>
Date: Wednesday, January 9, 2008 - 11:36 am

"Alexandre \"Sunny\" Kovalenko" writes:
>
> On Wed, 2008-01-09 at 08:23 -0500, Andrew Gallatin wrote:
> > =?UTF-8?B?6Z+T5a625qiZIEJpbGwgSGFja2Vy?= writes:
> > > > OTOH that's all GPFS is.
> > >
> > > Far more features than that - 'robust', 'fault tolerant', 'Disaster Recovery'
> > > ... all the usual buzzwords.
> > >
> > > And nothing prevents using 'cluster' tools on a single box. Not storage-wise anyway.
> >
> > Having had the misfortune of being involved in a cluster which used
> > GPFS, I can attest that GPFS is anything but "robust" and "fault
> > tolerant" in my experience. Granted this was a few years ago, and
> > things may have improved, but that one horrible experience was
> > sufficient to make me avoid GPFS for life.
> Would you mind sharing your experience, maybe in the private E-mail. I
> am especially interested in the platform you have used (as in AIX or
> Linux) and underlying storage configuration (as in directly attached vs.
> separate file system servers).
>
> I am running few small AIX clusters in the lab using GPFS 3.1 over iSCSI
> and so far was fairly pleased with that.

Linux, with GPFS 1.x over ethernet. If there was even the slightest
load on the ethernet network, and a GPFS heartbeat message got
lost, the entire FS would die. That did not meet my definition of
robust :(. Note that this was nearly 4 years ago, so it has likely
gotten better.

> However, OP's point was that ZFS has inherent cluster abilities, of
> which I have found no information whatsoever.

Indeed, but I do remember hearing the Lustre/ZFS rumors.

Drew
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Monday, January 7, 2008 - 11:23 am

I agree.

I build a new backup server ( DualCore processor, AMD64 Kernel, 3GB
Ram, 10x 500GB Sata disks, areca 1230) to receive data from all my
local servers on a 4TB zfs pool (using compression, ~ 300 snapshots
and ~90 filesystems) and after write to LTO3 Tape Drives.

It worked fine after the required tuning (vm patch, prefetch disable, etc)

But I lost my data two times. The first was in 11/12/2007, the system
freeze and after reboot I get the panic when trying to mount the zfs
pool:

Dump header from device /dev/ad0s1b
Architecture: amd64
Architecture Version: 2
Dump Length: 103477248B (98 MB)
Blocksize: 512
Dumptime: Mon Nov 12 14:56:12 2007
Hostname:
Magic: FreeBSD Kernel Dump
Version String: FreeBSD 7.0-BETA2 #0: Mon Nov 12 11:49:07 BRST 2007
root@:/usr/src/sys/amd64/compile/MANNY.debug
Panic String: solaris assert: ss == NULL, file:
/usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/space_map.c,
line: 110
Dump Parity: 2217569595
Bounds: 3
Dump Status: good

after some days giving some shots with Pawel (and his contact with
solaris people), we can't figure out the problem, I assume the lost
and recreate the zpool.

I decided to give another try, put more memory, do more "tuning" and
after one month all worked fine except the slowness when coping small
files to a tape drive (a started a new thread about that on
-performance http://www.mail-archive.com/freebsd-performance@freebsd.org/msg01764.html)
when I get another crash, this time with:

ZFS(panic): zfs: allocating allocated segment(offset=2781261201408 size=131072)

And again, I can't recover my zpool.

I had choose zfs because the fantastic features available, instant
snapshots, clones, native/transparent compression, the way that you
can create filesystems inside the pool limiting and reserving space,
all this make my backup solution simple amazing. But this crashes
forced me to step back and without a filesystem that can handle TB
without tedious fsck a had to ...

To: Scott Long <scottl@...>
Cc: <freebsd-current@...>, Ivan Voras <ivoras@...>
Date: Sunday, January 6, 2008 - 5:32 pm

To be clear, in this thread I have been mostly restricting myself to
discussion of kmem problems only, although I have also noted that there
are known ZFS bugs including bugs that are unfixed even in solaris (the
ZIL low memory deadlock is one of them). Indeed, pjd has a long list of
bug reports from me :)

I agree with the rest of this summary.

Kris
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Kris Kennaway <kris@...>
Cc: <freebsd-current@...>, Ivan Voras <ivoras@...>
Date: Sunday, January 6, 2008 - 5:54 pm

I guess what makes me mad about ZFS is that it's all-or-nothing; either
it works, or it crashes. It doesn't automatically recognize limits and
make adjustments or sacrifices when it reaches those limits, it just
crashes. Wanting multiple gigabytes of RAM for caching in order to
optimize performance is great, but crashing when it doesn't get those
multiple gigabytes of RAM is not so great, and it leaves a bad taste in
my mouth about ZFS in general.

Scott
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Scott Long <scottl@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 6:33 pm

I agree with the sentiment. I don't know about its implementation, but
surely some kind of backout could have be implemented? I'm just
guessing here: maybe the problem is in M_NOWAIT - maybe there could be
a M_NOWAIT_BUT_ALLOW_NULL that would be safe to use in non-sleepable
code but could return NULL, which could be tested and the whole file
system request postponed...
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 9:16 pm

Um, I don't think this part of the post means what I wanted it to mean - =

please ignore it - ETOOTIRED :)

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 6:32 pm

Scott Long wrote:

To be fair - every fs on the planet had to go through this at one time or another.

We have been perhaps 'spoiled' by the odd runaway log or such that has pushed
UFS to over 103% 'full', struggled on regardless, allowing us to ssh in from
12,000 miles away, kill the offender, clean up the mess, and soldier-on w/o even
a reboot, let alone a crash.

ZFS will (probably) get there one day as well.

But at present, it has become a distraction we don't need.

We're chasing promises...

I'd happily trade all future interest in ZFS for better ufs, nfs, smbfs, ntfs,
xfs, jfs, et al performance/safety/compatibility,

... if only 'coz that's where the bulk of the data we need to 'talk to' actually
resides - not on ZFS or GPFS.

Bill

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 1:08 pm

My admittedly second-hand understanding is that ZFS shows similarly gratuitous
memory use on both Mac OS X and Solaris. One advantage Solaris has is that it
runs primarily on expensive 64-bit servers with lots of memory. Part of the
problem on FreeBSD is that people run ZFS on sytems with 32-bit CPUs and a lot
less memory. It could be that ZFS should be enforcing higher minimum hardware
requirements to mount (i.e., refusing to run on systems with 32-bit address
spaces or <4gb of memory and inadequate tuning).

Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Robert Watson <rwatson@...>
Cc: <freebsd-current@...>, Ivan Voras <ivoras@...>
Date: Tuesday, January 22, 2008 - 11:09 am

Before ZFS was released, I was using it internally on a 32bit
desktop. It never panic'd although it did get very slow after
a while because of the way it managed memory (and probably some
bugs :) while in early alpha/beta.

At work I run it on my Ultra20 desktop with Solaris 10.
It has an AMD64 CPU and I'm pretty only 2GB of RAM,
but I'll have to check on the RAM.

Darren

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 1:28 pm

Solaris nowadays refuses to install on anything without at least 1 GB of =

memory. I'm all for ZFS refusing to run on inadequatly tuned hardware,=20
but apparently there's no algorithmic way to say what *is* adequately=20
tuned, except for "try X and if it crashes, try Y, repeat as necessary".

The reason why I'm arguing this topic is that it isn't a matter of=20
tuning like "it will run slowly if you don't tune it" - it's more like=20
"it won't run at all if you don't go through the laborious=20
trial-and-error process of tuning it, including patching your kernel and =

running a non-GENERIC configuration".

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 1:43 pm

What you appear to be still missing is that ZFS also causes memory
exhaustion panics when run on 32-bit Solaris. In fact (unless they have
since fixed it), the opensolaris ZFS code makes *absolutely no attempt*
to accomodate i386 memory limitations at all.

Kris
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 2:00 pm

Citation needed. I'm interested.

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 2:09 pm

Reports on the zfs-discuss mailing list.

Kris
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 2:49 pm

Thanks for the pointer. I'm looking at the archives.

So far I've found this:=20
http://www.archivum.info/zfs-discuss@opensolaris.org/2007-07/msg00016.htm=
l=20
which doesn't mention panics;

and this:=20
http://www.archivum.info/zfs-discuss@opensolaris.org/2007-07/msg00054.htm=
l=20
which didn't get any replies but the backtrace doesn't include anything=20
resembling a malloc-like call.

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Friday, January 4, 2008 - 12:33 pm

I suppose that depends what you mean by stable. It seems stable enough
for a number of applications today. It's clearly not widely tested
since we haven't shipped a release based on it. It's possible some of
the issues of memory requirements won't be fixable in 7.x, but I don't
think that's a given.

-- Brooks

To: Brooks Davis <brooks@...>
Cc: <freebsd-current@...>
Date: Friday, January 4, 2008 - 1:58 pm

My yardstick is currently "when a month goes by without anyone

This number is not so large. It seems to be easily crashed by rsync,
for example (speaking from my own experience, and also some of my

I listened to some of Pawel's talks and devsummit brainstormings and I
get the feeling *none* of the problems can be fixed in 7.x, especially
on i386. I'm just asking for more official confirmation.

This is not a trivial question, since it involves deploying systems to
be maintained some years into the future - if ZFS will become stable
relatively shortly, it might be worth putting up with crashes, but if
not, there will be no near-future deployments of it.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Cc: Brooks Davis <brooks@...>, Ivan Voras <ivoras@...>
Date: Sunday, January 6, 2008 - 5:51 am

I can definitely say this is not *generally* true, as I do a lot of=20
rsyncing/rdiff-backup:ing and similar stuff (with many files / large files)=
=20
on ZFS without any stability issues. Problems for me have been limited to=20
32bit and the memory exhaustion issue rather than "hard" issues.

But perhaps that's all you are referring to.

=2D-=20
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org

To: Peter Schuller <peter.schuller@...>
Cc: <freebsd-current@...>, Brooks Davis <brooks@...>
Date: Sunday, January 6, 2008 - 8:58 am

It's not generally true since kmem problems with rsync are often hard
to repeat - I have them on one machine, but not on another, similar

Mostly. I did have a ZFS crash with rsync that wasn't kmem related,
but only once.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>, Peter Schuller <peter.schuller@...>, Brooks Davis <brooks@...>
Date: Sunday, January 6, 2008 - 9:07 am

kmem problems are just tuning. They are not indicative of stability
problems in ZFS. Please report any further non-kmem panics you experience.

Kris

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Kris Kennaway <kris@...>
Cc: <freebsd-current@...>, Peter Schuller <peter.schuller@...>, Ivan Voras <ivoras@...>, Brooks Davis <brooks@...>
Date: Sunday, January 6, 2008 - 11:48 am

I encounter 2 times a deadlock during high I/O activity (the last one
during rsync + rm -r on a 5GB hierarchy (openoffice-2/work).

I was running with this patch:
http://people.freebsd.org/~pjd/patches/zgd_done.patch
db> show allpcpu
Current CPU: 1

cpuid = 0
curthread = 0xa5ebe440: pid 3422 "txg_thread_enter"
curpcb = 0xeb175d90
fpcurthread = none
idlethread = 0xa5529aa0: pid 12 "idle: cpu0"
APIC ID = 0
currentldt = 0x50

cpuid = 1
curthread = 0xa56ab220: pid 47 "arc_reclaim_thread"
curpcb = 0xe6837d90
fpcurthread = none
idlethread = 0xa5529880: pid 11 "idle: cpu1"
APIC ID = 1
currentldt = 0x50

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Henri Hennebert <hlh@...>
Cc: <freebsd-current@...>, Peter Schuller <peter.schuller@...>, Ivan Voras <ivoras@...>, Brooks Davis <brooks@...>
Date: Sunday, January 6, 2008 - 12:03 pm

Backtraces of the affected processes (or just alltrace) are usually
required to proceed with debugging, and lock status is also often vital
(show alllocks, requires witness). Also, in the case when threads are
actually running (not deadlocked), then it is often useful to repeatedly
break/continue and sample many backtraces to try and determine where the
threads are looping.

Kris
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Kris Kennaway <kris@...>
Cc: <freebsd-current@...>, Peter Schuller <peter.schuller@...>, Ivan Voras <ivoras@...>, Brooks Davis <brooks@...>
Date: Sunday, January 6, 2008 - 12:47 pm

I add it to my kernel config

I do this after the second deadlock and arc_reclaim_thread was always
there and second cpu was idle.

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Henri Hennebert <hlh@...>
Cc: <freebsd-current@...>, Peter Schuller <peter.schuller@...>, Ivan Voras <ivoras@...>, Brooks Davis <brooks@...>
Date: Sunday, January 6, 2008 - 1:13 pm

To repeat, it is important not just to note which thread is running, but
*what the thread is doing*. This means repeatedly comparing the
backtraces, which will allow you to build up a picture of which part of
the code it is looping in.

Kris
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 9:46 am

I agree that ZFS is pretty stable itself. I use 32bit machine with
2gigs od RAM and all hang cases are kmem related, but the fact is that
I haven't found any way of tuning to stop it crashing. When I do some
rsyncing, especially beetwen different pools - it hangs or reboots -
mostly on bigger files (i.e. rsyncing ports tree with distfiles).
At the moment I patched the kernel with vm_kern.c.2.patch and it just
stopped crashing, but from time to time the machine looks like beeing
freezed for a second or two, after that it works normally.
Have you got any similar experience?
--
regards, Maciej Suszko.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Maciej Suszko <maciej@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 11:46 am

That is expected. That patch makes the system do more work to try and
reclaim memory when it would previously have panicked from lack of
memory. However, the same advice applies as to Ivan: you should try and
tune the memory parameters better to avoid this last-ditch sitation.

Kris

P.S. It sounds like you do not have sufficient debugging configured
either: crashes should produce either a DDB prompt or a coredump so they
can be studied and understood.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 12:05 pm

As Ivan said - tuning kmem_size only delay the moment system crash,

You're right - I turned debugging off, because it's not a production
machine and I can afford such behaviour. Right now, using kernel with
kmem patch applied it's ,,usable''.
--
regards, Maciej Suszko.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Maciej Suszko <maciej@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 12:22 pm

So the same question applies: exactly what steps did you take to tune
the memory parameters? Extracting this information from you guys
shouldn't be as hard as this :)

Kris

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 3:56 pm

I was playing around with kmem_max_size mainly. I suppose messing up
with KVA_PAGES is not a good idea unless you exactly know how much
memory you software consume...
--
regards, Maciej Suszko.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 9:50 am

I disagree - anything that causes a panic is a stability problem. Panics =

persist AFTER the tunings (for i386 certainly, and there are unsolved=20
reports about it on amd64 also) and are present even when driving kmem=20
size to the maximum. The tunings *can not solve the problems* currently, =

they can only delay the time until they appear, which, by Murphy, often=20
means "sometime around midnight at Saturday". See also the possibility=20

I did, once to Pawel and once to the lists. Pawel couldn't help me and=20
nobody responded on the lists. Can you perform a MySQL read-write=20
benchmark on one of the 8-core machines with database on ZFS for about=20
an hour without pause? On a machine with 2 GB (or less) of RAM,=20
preferrably? I've seen problems on i386 but maybe they are also present=20
on amd64.

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 10:27 am

That's an assertion directly contradicted by my experience running a
heavily loaded 8-core i386 package builder. Please explain in detail
the steps you have taken to tune your kernel. Do you have the vm_kern.c
patch applied?

> See also the possibility
> of deadlocks in the ZIL, reported by some users.

Yes, this is an outstanding issue. There are a couple of others I run

I am not set up to test this right now.

Kris

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Kris Kennaway <kris@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 10:51 am

What is the IO profile of this usage? I'd guess that it's "short
bursts of high activity (archive extraction, installing) followed by
long periods of low activity (compiling)". From what I see on the
lists and somewhat from my own experience, the problem appears more
often when the load is more like "constant high r+w activity",
probably with several users (applications) doing the activity in

vm.kmem_size="512M"
vm.kmem_size_max="512M"

I can confirm that while it delays the panics, it doesn't eliminate
them (this also seems to be the conclusion of several users that have
tested it shortly after it's been posted). The fact that it's not
committed is good enough indication that it's not The Answer.

(And besides, asking users to apply non-committed patches just to run
their systems normally is bad practice :) I can just imagine the
Release Notes: "if you're using ZFS, you'll have to manually patch the
kernel with this patch:..." :)

This close to the -RELEASE, I judge the chances of it being committed are low).
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 11:08 am

No, clearly it is not enough (and you claimed previously to have done
more tuning than this). I have it set to 600MB on the i386 system with

ZFS already tells you up front that it's experimental code and likely to
have problems. Users of 7.0-RELEASE should not have unrealistic

Perhaps, but that only applies to 7.0-RELEASE.

Kris

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 12:45 pm

This looks like we're constantly chasing the "right amount". Does it=20
depend so much on CPU and IO speed that there's never a generally=20
sufficient "right amount"? So when CPU and drive speed increase, the new =

Where? What else is there except kmem tuning (including KVA_PAGES)? IIRC =

My point is that the fact that such things are necessary (1.5 GB KVA os=20
a lot on i386) mean that there are serious problems which aren't getting =

fixed since ZFS was imported (that's over 6 months ago).

I see you've added to http://wiki.freebsd.org/ZFSTuningGuide; can you=20
please add the values that work for you to it (especially for KVA_PAGES=20
since the exact kernel configuration line is never spelled out in the=20

I know it's experimental, but requiring users to perform so much tuning=20
just to get it work without crashing will mean it will get a bad=20
reputation early on. Do you (or anyone) know what are the reasons for=20
not having vm.kmem_size to 512 MB by default? Better yet, why not=20
increase both vm.kmem_size and KVA_PAGES to (the equivalent of) 640 MB=20
or 768 MB by default for 7.0?

>Users of 7.0-RELEASE should not have unrealistic
> expectations.

As I've said at the first post of this thread: I'm interested in if it's =

ever going to be stable for 7.x.

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 1:12 pm

It depends on your workload, which in turn depends on your hardware.

Tuning is an interactive process. If 512MB is not enough kmem_map, then

ZFS is a memory hog. There is nothing that can really be done about
this, and it is just not a good fit on i386 because of limitations of
the hardware architecture. Note that Sun does not recommend using ZFS
on a 32-bit system either, for the same reasons. It is unlikely this
can really be fixed, although mitigation strategies like the vm_kern.c

Increasing vm.kmem_size.max to 512MB by default has other implications,
but it is something that should be considered.

That is answered in the tuning guide. Tuning KVA_PAGES by default is

This was in reply to a comment you made about the vm_kern.c patch
affecting users of 7.0-RELEASE.

Anyway, to sum up, ZFS has known bugs, some of which are unresolved by
the authors, and it is difficult to make it work on i386. It is likely
that the bugs will be fixed over time (obviously), but amd64 will always
be a better choice than i386 for using ZFS because you will not be
continually bumping up against the hardware limitations.

Kris
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Kris Kennaway <kris@...>
Cc: <freebsd-current@...>, Ivan Voras <ivoras@...>
Date: Sunday, January 6, 2008 - 2:48 pm

As a user, I would expect the above to mean "to continue running quickly".
If it has to slow to a crawl for a moment, due to inadequate memory in
your system, then that's just tough cookies. But crashing (panicing)
is not really acceptable for most people (maybe except a developer).
Again from a user perspective, if ZFS needs "tuning" to run at full speed,
or even at all, I would expect *it* to be able to do a few simple calculations
and do the tuning itself! :-) (even if, in worst case, it requires a clean
shutdown and reboot for the new values to take effect)

The above is not meant as a criticism of the current explicitly-labeled
"experimental" code. Rather, it is what I would hope we might be able

Perhaps the 7.0 release notes should include a note to the effect that
ZFS is *strongly* NOT RECOMMENDED on 32-bit systems at this time, due
to the likelihood of panics. I say this because it sure sounds like
"out of the box" that is what you're most likely to end up with, and
even with manual "corrections" you may still have panics. So why not
just be upfront about it and tell people that, at least at this time,
ZFS is only recommended for 64-bit systems, with a minimum of "N" (2?)
GB of memory? If you were already planning something like this for

BTW, I am a happy user of ZFS on a 2GB Core2Duo 64-bit system. I never
did any "tuning", it "just worked" for my light-duty file serving needs.
This was from the (I believe) May 2007 snapshot.

Gary
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Gary Corcoran <gcorcoran@...>
Cc: Kris Kennaway <kris@...>, <freebsd-current@...>, Ivan Voras <ivoras@...>
Date: Sunday, January 6, 2008 - 5:56 pm

By watching this and other threads on ZFS and reading Sun's own design
papers I am getting strong impression that this should be even more
strong than strong NOT RECOMMENDED. Perhaps ZFS should BE DISALLOWED to
run on i386 at all (unless one does some manual source code tweaking or
something like this, and hence can ask no official support from the
project).

I believe that 95% of hardware today that realistically is capable of
running ZFS is also capable of running 64bit code, so that potential ZFS
users are far better off switching to FreeBSD/amd64 and help
testing/improving that architecture than fighting architectural
limitations of already dying i386. And we are as a project are better
off too, by spending out limited resources on something that has future.

From my own experience FreeBSD/amd64 is quite mature for running most
if not all of the server tasks today and ZFS is first and foremost a
server FS. The only place where FreeBSD/i386 beats FreeBSD/amd64 is
desktop, due to binary drivers and such, but ZFS is almost useless
there. So that by simply officially disallowing ZFS on FreeBSD/i386 we
could win by a great margin.

Just my CAD0.02.

-Maxim
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Maxim Sobolev <sobomax@...>
Cc: Gary Corcoran <gcorcoran@...>, Ivan Voras <ivoras@...>, <freebsd-current@...>
Date: Monday, January 7, 2008 - 10:52 am

As someone who's been running ZFS happilly ever since pjd committed it
to CURRENT early 2007 on i386 with 1GB of RAM I would definitly say
NO!
Put up warnings, banners and whatever you want but disabling it just
because some users had some panics or just haven't given up time to
tune their system (I'm all in favor of auto tunning here) just doesn't
seem reason enough for me to limit other people's choices.

I've listed it before but again for the record:
i386 Xeon, 1GB RAM
4x320GB RAIDZ with root on zfs
zil enabled, prefetching disabled to improve video play
Shared via NFS and Samba

cat /boot/loader.conf
zfs_load="YES"
vfs.root.mountfrom="zfs:r4x320"
vfs.zfs.prefetch_disable=1

That's it on my loader.conf and for months now I haven't seen a panic.
Why should I or anyone else happilly running ZFS on i386 be denied of
doing so?

--
Joao Barros
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Cc: Joao Barros <joao.barros@...>, Gary Corcoran <gcorcoran@...>, Ivan Voras <ivoras@...>
Date: Monday, January 7, 2008 - 1:06 pm

100% agreement.

If you want to go to extremes, require the user to put=20
zfs.zfs.run_on_32_bit_and_i_understand_i_am_an_idiot_and_this_is_not_recomm=
ended=3D1=20
in loader.conf, or else have the kernel panic by design on boot.

But don't make it totally impossible without patching the source, *please*.

Obviously the exception is if development for i386 stops such that it actua=
lly=20
does not work. But disallowing it for artificial reasons... please leave=20
things like that to proprietary hardware/software vendors trying to squeeze=
=20
money of out consumers, and leave it out of a free operation system.

=2D-=20
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org

To: <freebsd-current@...>
Date: Monday, January 7, 2008 - 6:47 pm

On Mon, 7 Jan 2008 18:06:55 +0100

Just a note to put another '100% agreement' sign up. We have plenty of
other FSes which are half-cooked and can easily hurt people, but nobody
suggests removing them. Why ZFS should be singled out, it is in a way
better shape than most of them.

--=20
Alexander Kabaev

To: Maxim Sobolev <sobomax@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 3:10 am

[Empty message]
To: Christian Walther <cptsalek@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 4:20 am

That could be a good thing (think programs creating lots of files and

How is that different to creating one / slice of FFS?

Igor :-)
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Igor Mozolevsky <igor@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 4:51 am

Hello igor,

With ZFS there aren't fixed boundaries as there are with the
slice/partition theme. You can use reservation and quota to determine
how much free space is guaranteed for a ZFS and the maximum size a ZFS
is allowed to grow to.
If you feel that these boundaries/limitations aren't of any use

Christian
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Maxim Sobolev <sobomax@...>
Cc: <freebsd-current@...>, Kris Kennaway <kris@...>, Gary Corcoran <gcorcoran@...>, Ivan Voras <ivoras@...>
Date: Sunday, January 6, 2008 - 7:39 pm

Hi,

All new hardware since Intel started supporting 64 bits on their

Let's see it much more practical. Are all features and all ports all the
time supported on all platforms?

I do not think so.

So, just make it a requirement for ZFS to run only on 64 bit upward.

It is not that FreeBSD does not have some kind o file system for older
machines.

Erich
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Tuesday, January 8, 2008 - 1:51 pm

Erich Dollansky wrote:
> Maxim Sobolev wrote:
> > Gary Corcoran wrote:
> >
> > I believe that 95% of hardware today that realistically is capable of
>
> I do not think so.

Actually I think it's less than 95%.

Of the seven machines I have at home, only one is 64bit
capable -- and that one happens to be a DEC-Alpha which
doesn't support ZFS.

Of the machines at our office room (dunno the count,
must be about a dozen) only one is amd64 capable --
and that one happens to be a workstation that needs
to run 32bit i386 because of X11 graphics support
(and I don't really need to use ZFS on it).

> > running ZFS is also capable of running 64bit code, so that potential ZFS
>
> All new hardware since Intel started supporting 64 bits on their
> Pentiums is.

Nope. There's still hardware produced today that's not
64bit-capable.

FWIW, my NFS server at home is an EPIA PD-10k board with
a VIA C3 processor (32bit only). I chose that one because
of the very low power consumption. It works perfectly
well for my purposes.

> So, just make it a requirement for ZFS to run only on 64 bit upward.

I would certainly vote against such nonsense.

However, I think it does make sense to print a warning
if an admin tries to use ZFS on an i386 machine. It
wouldn't hurt anyway.

It's quite normal that running certain software requires
some tuning so that software will work at all. Typical
examples are squid (uses a lot of sysv message queues)
and PostgreSQL (semaphores) -- they won't run without
tuning, except for trivial setups that don't really do
much. The ZFS tuning issues aren't much different.

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Ge...

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 7:32 pm

On Sun, Jan 06, 2008 at 01:56:59PM -0800, Maxim Sobolev wrote:

By watching this and other threads on ZFS and reading Sun's own design
papers I am getting strong impression that this should be even more strong
than strong NOT RECOMMENDED. Perhaps ZFS should BE DISALLOWED to run on
i386 at all (unless one does some manual source code tweaking or something
like this, and hence can ask no official support from the project).

I feel like stating my opinion on this, noting that I am usually stubborn enough to
squeeze alot of value out of any rough product, while avoiding complaining about
continuing problems if I am not prepared to put in appropriate effort to solve them.

A summary of my opinion on this matter is that some i386 FreeBSD servers do have a
place running zfs in a useful role, but some dedication and patience from the
administrator is usually required, and the effort to tune at least kmem is nearly
required on ALL hardware platforms, not just i386. I think kmem shortages from zfs
are simply more touchy on i386, and with enough ram and slightly more tuning than
amd64 the kmem can most likely be tuned away, but this does not do anything for
other zfs problems such as zil deadlocks and other deadlocks. I think doing
something to prevent FreeBSD/i386 users from using zfs will just rule out a portion
of the people having problems, and admins who take a little time to tune zfs AND use
it more than just lightly may continue to have problems, and will just come back to
the lists.

I have zfs on at least 4 systems presently, each one tuned to where I no longer
receive kmem panics at least based on their expected system load. 2 of them
are i386 and I would be quite dismayed to upgrade RELENG_7 to find ZFS has
been disabled for me (although since I read the mailing lists I would expect
it and deal appropriately). It would be a tradeoff between breaking a limited
amount of existing setups versus somehow limiting the influx of new zfs users
who _may_ en...

To: Adam McDougall <mcdouga9@...>
Cc: <freebsd-current@...>
Date: Tuesday, January 8, 2008 - 3:56 am

Note that you're probably running into an integer overflow in arc.c
if vm.kmem_size is set to 1GB or higher on i386. As a result
kstat.zfs.misc.arcstats.size won't grow above
kstat.zfs.misc.arcstats.c_min...

I posted to freebsd-fs about it, but haven't heard anything from pjd, yet.

Sadly, after fixing that problem I started encountering the kmem_map
too small panics.

--
David Taylor
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Adam McDougall <mcdouga9@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 8:58 pm

In Russian we have a good saying: "You can teach a bear to ride a
bicycle, but will it ever enjoy it?"

The same is here - seemingly due to the ZFS design limitations and
limitations of the FreeBSD kernel you can't get ZFS to run reliably out
of the box on i386. Yes, you can probably do some tweaks here and there,
to make it more of less stable given the workload, but that's not what
most of the FreeBSD users expect from the file system. Unlike you, most
of administrators won't even bother to read tweaking documentation
explaining why ZFS is so tricky in i386, let alone doing actual
trial-and-error to determine the right set of tunables. More likely at
the first incident they would just dismiss FreeBSD/ZFS as a crap.

-Maxim
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Maxim Sobolev <sobomax@...>
Cc: Adam McDougall <mcdouga9@...>, <freebsd-current@...>
Date: Monday, January 7, 2008 - 11:36 am

I enjoy my i386/ZFS Servers.
It is running with just 384MB RAM as the only instability it currently
has is because the / disk ist dying.
But even with this it has 71 days uptime.
And my backup server is also based on ZFS with just 196MB RAM.
This one isn't running stable, but it is stable for just doing the zfs
imports and restoring some files.
The only 64 bit machines I have at home are alpha and spac64, so no
option for ZFS right now.

I don't see a real difference between running a 2G i386 and a 2G amd64
server.
8GB amd64 boxes are still not very common.
Of course the i386 must be tuned to have enough kmem and KVA and of
course doing so reduces the application space, but it is still within
the hardware limitations and an NFS-Server doesn't need much application
address space anyway.
Considered that we seem to have a limitation on running amd64 with more

This is true however.
I'm OK with a big fat warning on i386 and/or a loader env to be set
to reenable this for persons who know (or think they do) what they are
doing.
If we say ZFS on i386 isn't supported than at least it shouldn't be
able to be configured without anyone hitting a special knob.

But in my opinion ZFS shouldn't overflow kmem storage at first.
This is not only an i386 problem as rasing kmem makes ZFS more hungry
by default, which is only good if kmem isn't used for something else.
I have an amd64 system with 4G RAM and kmem defaults to 419430400 Bytes.
On my 384MB i386 box I have (tuned of course) 335544320 Bytes kmem.
And with more RAM I could easily go over the default on the amd64 box.
Well RAM is too expensive since it is SDRAM, but there are many DDR
boxes without amd64 functionality out there, which allows adding
affordable memory in the nGB range.
I had to reduce ARC sizes to get the 384MB box stable - the OS version
is quite old and things have been modified for that in the meantime,
but considered that 2G i386 systems still panic with more kmem than
I have RAM it simply says that it wouldn't run ...

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 1:20 pm

Ok. I'd like to understand what is the relationship between KVA_PAGES=20
and vm.kmem_size. The tuning guide says:

"""By default the kernel receives 1GB of the 4GB of address space=20
available on the i386 architecture, and this is used for all of the=20
kernel address space needs, not just the kmem map. By increasing=20
KVA_PAGES you can allocate a larger proportion of the 4GB address=20
space..."""

and:

"""recompile your kernel with increased KVA_PAGES option, to increase=20
the size of the kernel address space, before vm.kmem_size can be=20
increased beyond 512M"""

What is the other 512 MB of the 1 GB used for?

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 1:34 pm

Everything else that the kernel needs address space for. Buffer cache,
mbuf allocation, etc.

Kris
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Kris Kennaway <kris@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 4:56 pm

Aren't they allocated from the same memory zones? I have a router with 256
Mb RAM, it had a panic with ng_nat once due to exhausted kmem. So, what
these number from it's sysctl do really mean?

vm.kmem_size: 83415040
vm.kmem_size_max: 335544320
vm.kmem_size_scale: 3
vm.kvm_size: 1073737728
vm.kvm_free: 704638976

--
WBR, Vadim Goncharov
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Vadim Goncharov <vadimnuclight@...>
Cc: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 5:42 pm

I believe that mbufs are allocated from a separate map. In your case
you only have ~80MB available in your kmem_map, which is used for
malloc() in the kernel. It is possible that ng_nat in combination with
the other kernel malloc usage exhausted this relatively small amount of
space without mbuf use being a factor.

Kris
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Kris Kennaway <kris@...>
Cc: <freebsd-current@...>, Vadim Goncharov <vadimnuclight@...>
Date: Sunday, January 6, 2008 - 6:33 pm

Actually, with mbuma, this has changed -- mbufs are now allocated from the
general kernel map. Pipe buffer memory and a few other things are still
allocated from separate maps, however. In fact, this was one of the known
issues with the introduction of large cluster sizes without resource limits:
address space and memory use were potentially unbounded, so Randall recently
properly implemented the resource limits on mbuf clusters of large sizes.

Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Robert Watson <rwatson@...>, Kris Kennaway <kris@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 11:16 am

Yes, in-kernel libalias is "leaking" in sense that it grows unbounded, and
uses malloc(9) instead if it's own UMA zone with settable limits (it frees
all used memory, however, on shutting down ng_nat, so I've done a
workaround restarting ng_nat nodes once a month). But as I see the panic
string:

panic: kmem_malloc(16384): kmem_map too small: 83415040 total allocated

and memory usage in crash dump:

router:~# vmstat -m -M /var/crash/vmcore.32 | grep alias
libalias 241127 30161K - 460568995 128
router:~# vmstat -m -M /var/crash/vmcore.32 | awk '{sum+=$3} END {print
sum}'
50407

...so why only 50 Mb from 80 were used at the moment of panic?

BTW, current memory usage (April 6.2S, ipf w+ 2 ng_nat's) a week after
restart is low:

vadim@router:~>vmstat -m | grep alias
libalias 79542 9983K - 179493840 128
vadim@router:~>vmstat -m | awk '{sum+=$3} END {print sum}'

I still don't understand what that numbers from sysctl above do exactly
mean - sysctl -d for them is obscure. How many memory kernel uses in RAM,
and for which purposes? Is that limit constant? Does kernel swaps out
parts of it, and if yes, how many?

--
WBR, Vadim Goncharov
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Vadim Goncharov <vadim_nuclight@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 11:39 am

Did you have any luck raising interest from Paulo regarding this problem? Is
there a PR I can take a look at? I'm not really familiar with the code, so
I'd prefer someone who was a bit more familiar with it looked after it, but I

This is a bit complicated to answer, but I'll try to capture the gist in a
short space.

The kernel memory map is an address space in which pages can be placed to be
used by the kernel. Those pages are often allocated using one of two kernel
allocators, malloc(9) which does variable sized memory allocations, and uma(9)
which is a slab allocator and supports caching of complex but fixed-size
objects. Temporary buffers of variable size or infrequently allocated objects
will use malloc, but frequently allocated objects of fixed size (vnods, mbufs,
...) will use uma. "vmstat -m" prints out information on malloc allocations,
and "vmstat -z" prints out information on uma allocations.

To make life slightly more complicated, small malloc allocations are actually
implemented using uma -- there are a small number of small object size zones
reserved for this purpose, and malloc just rounds up to the next such bucket
size and allocations from that bucket. For larger sizes, malloc goes through
uma, but pretty much directly to VM which makes pages available directly. So
when you look at "vmstat -z" output, be aware that some of the information
presented there (zones named things like "128", "256", etc) are actually the
pools from which malloc allocations come, so there's double-counting.

There are also other ways to get memory into the kernel map, such as directly
inserting pages from user memory into the kernel address space in order to
implement zero-copy. This is done, for example, when zero-copy sockets are
used.

To make life just very slightly more complicated even, I'll tell you that
there are something called "submaps" in the kernel memory map, which have
special properties. One of these is used for mapping the buffer cache...

To: Robert Watson <rwatson@...>, Paolo Pisati <piso@...>
Cc: <freebsd-current@...>
Date: Monday, January 7, 2008 - 7:28 pm

No, i didn't do that yet. Brief search, however, shows kern/118432, though
it is not directly kmem issue, and also thread
http://209.85.135.104/search?q=cache:lpXLlrtojg8J:archive.netbsd.se/%3Fm...
in which memory exhaustion problem was predicted. Also, I've heard some
rumors about ng_nat memory panics under very heavy load, but a man with
300Mbps router with several ng_nat's said his router is rock stable for
half a year - though his router has 1 Gb of RAM and mine only 256 Mb (BTW,
it's his system that has crashed recently with kern/118993, but this is

Yes, I've known it, but didn't known what column names exactly mean.
Requests/Failures, I guess, is a pure statistics, Size is one element

Last time I've tried it on 5.4 it caused panics every several hours on my

So, is the kernel memory map global thing that covers entire kernel or
there several maps in kernel, say, one for malloc(), one for other UMA,
etc. ? Recalling sysctl values from my previous message:

vm.kmem_size: 83415040
vm.kmem_size_max: 335544320
vm.kmem_size_scale: 3
vm.kvm_size: 1073737728
vm.kvm_free: 704638976

So, kvm_size looks like amount of KVA_PAGES, covering entire kernel
address space, plugged to every process' address space. But more than 300
megs are used, while machine has only 256 Mb of RAM. I see line in top:

Mem: 41M Active, 1268K Inact, 102M Wired, 34M Buf, 94M Free

I guess 34M buffer cache is entirely in-kernel memory, is this part of
kmem_size or another part of kernel space? What does kmem_size_max and
kmem_size_scale do - can kmem grow dynamically? Does kmem_size of about 80
megs mean that 80 megs of RAM is constantly used by kernel for it's needs,
including buffer cache, and other 176 megs are spent for processes RSS, or

We can assume for simplicty that their memoru is not-so-kernel but part of

Umm. I think there is no point in swapp...

To: Vadim Goncharov <vadim_nuclight@...>
Cc: <freebsd-current@...>, Paolo Pisati <piso@...>
Date: Monday, January 7, 2008 - 7:39 pm

Possibly we should rename the "FREE" column to "CACHE" -- the free count is
the number of items in the UMA cache. These may be hung in buckets off the
per-CPU cache, or be spare buckets in the zone. Either way, the memory has to
be reclaimed before it can be used for other purposes, and generally for
complex objects, it can be allocated much more quickly than going back to VM
for more memory. LIMIT is an administrative limit that may be configured on
the zone, and is configured for some but not all zones.

I'll let someone with a bit more VM experience follow up with more information

If it is mapped into the kernel address space, then it still counts towards
the limit on the map. There are really two critical resources: memory itself,
and address space to map it into. Over time, the balance between address
space and memory changes -- for a long time, 32 bits was the 640k of the UNIX
world, so there was always plenty of address space and not enough memory to
fill it. More recently, physical memory started to overtake address space,
and now with the advent of widely available 64-bit systems, it's swinging in
the other direction. The trick is always in how to tune things, as tuning
parameters designed for "memory is bounded and address space is infinite"
often work less well when that's not the case. In the early 5.x series, we
had a lot of kernel panics because kernel constants were scaling to physical
memory rather than address space, so the kernel would run out of address

Yes, that's what I meant. There are some other types of pageable kernel
memory, such as memory used for swap-backed md devices.

Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Robert Watson <rwatson@...>
Cc: <freebsd-current@...>
Date: Tuesday, January 8, 2008 - 2:58 pm

And every unlimited zone after growing on demand can cause
kmem_map/kmem_size panics, or some will low-memeory panics with message

That would be good, as I still don'tany idea about exact meaning of those

Hmm, I do remember messages about malloc-backed md devices panics (with
workaround advices to switch to swap-backed md), yes...

--
WBR, Vadim Goncharov
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: Vadim Goncharov <vadim_nuclight@...>
Cc: <freebsd-current@...>
Date: Tuesday, January 8, 2008 - 3:22 pm

Well, there are also limits not imposed using the UMA limit mechanism, so just
because it appears unbounded in vmstat -z doesn't mean there's no limit.
There's no UMA zone limit on processes, but there's a separately imposed
maxproc limit--and as a result, filedesc, which is typically one per process,
is also bounded to approximately maxproc. Likewise, many other data
structures effectively scale with the number of processes, the size of
physical memory, the size of the address space, maxusers, etc.

There are relatively few things that actually have no limit associated with
them one way or another, precisely because if there's no limit it can lead the
kernel to become starved of resources. Where there isn't a limit, ideally
privilege is required to allocate (i.e., malloc-backed swap requires root
privilege to configure). Sometimes the limits are much more complex than a
single global limit, such as resources controlled using resource limits, which
can be per-process, per-uid, etc.

Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

To: <freebsd-current@...>
Date: Sunday, January 6, 2008 - 6:45 pm

Is this related to reported panics with ZFS and a heavy network load=20
(NFS mostly)?

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Tuesday, January 8, 2008 - 5:19 am

Handling resource exhaustion is a tricky issue, because sometimes it takes
resources to make resources available. In the presence of a really greedy
(that is to say, effectively leaking) subsystem, there isn't really any way to
recover. There are really two alternatives: deadlock (no resources are
available, so no progress can be made) or panic (no resources are available so
do the only thing we can). Subsystems are relied upon to impose their own
limits, or at least provide those limits to UMA so that UMA can impose them,
as "appropriate" limits are entirely dependent on context. It's indeed the
case that the more load the system is under, the more resources are in use,
and therefore the lower the threshold for any particular system to contribute
to a potential exhaustion of resources. If the network is at a very high
watermark, then indeed ZFS has to use less to exhaust it.

Normally, subsystems like the network stack and file systems rely on "back
pressure" to cause them to release memory -- the network stack largely
allocates using UMA, so the VM low memory event frees up its caches, and it
also implements its own per-protocol low memory handlers, doing things like
discarding TCP reassembly buffers, etc. VM also knows to discard un-dirtied
pages. Pawel has a patch to make ZFS more agressively call low memory event
handlers when it gets a bit too greedy, which I saw in the re@ MFC queue
yesterday, it you might find this improves behavior a bit more. However,
things do get quite tricky when you're low on resources, because you waiting
indefinitely for resources rather than panicking may actually be worse,
because the system may never recover. That's why constaining initial resource
and responding to back pressure early is critical, in order to avoid getting
into situations where the only possible response is to hang or panic.

There's an interesting paper by Gibson, et al, from CMU on economic models for
"investing" memory pages in different sorts...

To: Ivan Voras <ivoras@...>
Cc: <freebsd-current@...>
Date: Friday, January 4, 2008 - 2:12 pm

I saw those crashes early one, but that's 90% of what the mirror server
I'm running does and I'm not seeing them any more. I won't argue

My understanding is that ZFS will never be a great choice on any 32-bit
architecture without major changes Sun probably isn't interested in
making. I think many of the problems people are reporting stem from

I don't think anyone is naive enough to say everything will be perfect
by any given date. Reality doesn't work that way. People looking to
deploy ZFS now will need to tolerate a certain amount of risk since it's
never been part of a FreeBSD release (and it's still quite new even in
Solaris). Issues being unfixable in 7.x are one of those risks, but
that's always the case.

-- Brooks

Previous thread: Strange kernel trap 12 with vm_page_splay() on FreeBSD/i386 SMP 7.0-RC1 by Xin LI on Thursday, January 3, 2008 - 9:55 pm. (1 message)

Next thread: 7.0-PRERELEASE installworld fails by Unga on Friday, January 4, 2008 - 7:58 am. (1 message)