Re: kern/134011

Previous thread: uplcom write:Device not configured by Marten Vijn on Monday, May 25, 2009 - 2:50 pm. (8 messages)

Next thread: nagios dies with signal 10 by Stefan Bethke on Monday, May 25, 2009 - 9:37 pm. (6 messages)
From: Randy Bush
Subject: kern/134011
Date: Monday, May 25, 2009 - 5:25 pm

[ yes, dear, i know i should not run current on production systems.
  but then, if no one does, how are we gonna shake out proplems under
  load and real life conditions.  someone has to do it. ]

this bug is now causing system lockup when the midnight gmt jobs run on
one system and it manifesting with less serious consequences on three
others.

the servers are all racked and remote but have serial console access.
how can i be of help finding this one?

randy
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Randy Bush
Date: Monday, May 25, 2009 - 5:29 pm

oh, and other folk have reported on list of seeing the same.  though
they have not added to the pr.

randy
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Scott Long
Date: Monday, May 25, 2009 - 6:11 pm

If you're using ZFS then you want to get to the tip of current to pick 
up the VM backpressure fixes that were added.

Scott
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Randy Bush
Date: Tuesday, May 26, 2009 - 9:11 pm

i cvsupped and built and installed new kernel and world.  cvsup of May
26 00:36

it locks up solid very reliably

i do not think this is a related bug.

randy
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Larry Rosenman
Date: Monday, May 25, 2009 - 6:49 pm

I'm still seeing occasional (I.E. I haven't pinned it down) ZFS write
crashes.  (this is with a current of 23-May-2009.)

See my posts earlier today.  I expect to get a textdump with the UMA and
malloc stats that Kip requested in the next 24-72 hours if it stays true to


-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 512-248-2683                 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893


_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Randy Bush
Date: Monday, May 25, 2009 - 6:24 pm

the problem is worst on a zfs system, crashing.  and it is upgrading now
and does so once a week.

the problem also manifests on non-zfs systems.

randy
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Kip Macy
Date: Tuesday, May 26, 2009 - 9:13 pm

Which arch?

How much memory?

What are your loader.conf settings?




-Kip



-- 
When bad men combine, the good must associate; else they will fall one
by one, an unpitied sacrifice in a contemptible struggle.

    Edmund Burke
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Randy Bush
Date: Tuesday, May 26, 2009 - 9:46 pm

> Which arch?



on the worst one, which runs zfs

# grep -v ^# /boot/loader.conf*
/boot/loader.conf.local:loader_logo=beastie
/boot/loader.conf.local:console="comconsole vidconsole"
/boot/loader.conf.local:comconsole_speed=9600
/boot/loader.conf.local:vfs.zfs.prefetch_disable=1
/boot/loader.conf.local:zfs_load=YES
/boot/loader.conf.local:vfs.zfs.prefetch_disable=1
/boot/loader.conf.local:geom_mirror_load=YES
/boot/loader.conf.local:kern.maxvnodes=50000

on another which has gmirror, not zfs

# grep -v ^# /boot/loader.conf*
/boot/loader.conf.local:loader_logo=beastie
/boot/loader.conf.local:console="comconsole vidconsole"
/boot/loader.conf.local:comconsole_speed="9600"
/boot/loader.conf.local:vm.pmap.pg_ps_enabled=1
/boot/loader.conf.local:geom_mirror_load=YES

...

randy
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Kip Macy
Date: Tuesday, May 26, 2009 - 10:03 pm

-- 
When bad men combine, the good must associate; else they will fall one
by one, an unpitied sacrifice in a contemptible struggle.

    Edmund Burke
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Randy Bush
Date: Tuesday, May 26, 2009 - 10:05 pm

yep.  i am presuming that it is some kernel or other config aspect.

randy
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Kip Macy
Date: Thursday, May 28, 2009 - 12:53 am

What type of hard drives?

How big are your zpools?

Do you use compression?


(I'm wondering if compression and slow disks have something to do with it)


-- 
When bad men combine, the good must associate; else they will fall one
by one, an unpitied sacrifice in a contemptible struggle.

    Edmund Burke
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Randy Bush
Date: Thursday, May 28, 2009 - 2:43 am

a zfs system

ad4: 305245MB <Seagate ST3320620AS 3.AAK> at ata2-master SATA150
ad6: 305245MB <Seagate ST3320620AS 3.AAE> at ata3-master SATA150
ad8: 305245MB <Seagate ST3320620AS 3.AAE> at ata4-master SATA150
ad10: 305245MB <Seagate ST3320620AS 3.AAK> at ata5-master SATA150

a gmirror system

ad4: 238475MB <Seagate ST3250820NS 3.AEK> at ata2-master SATA150
ad5: 238475MB <Seagate ST3250820NS 3.AEK> at ata2-slave SATA150
ad6: 238475MB <Seagate ST3250820NS 3.AEK> at ata3-master SATA150

again, this is happening on non-zfs systems as well.  i do not think
this is zfs related.  but the zfs system is the one with the worst
lockups.  it looks like

Filesystem        1024-blocks      Used     Avail Capacity  Mounted on
/dev/mirror/boota     8122126    636960   6835396     9%    /
devfs                       1         1         0   100%    /dev
procfs                      4         4         0   100%    /proc
tank/data           653313024         0 653313024     0%    /data
tank/data/nfsen     845243776 191930752 653313024    23%    /data/nfsen
tank/data/rpki      653494144    181120 653313024     0%    /data/rpki
tank                653313024         0 653313024     0%    /tank
tank/usr            658919040   5606016 653313024     1%    /usr
tank/usr/home       660368256   7055232 653313024     1%    /usr/home
tank/usr/usr        658758144   5445120 653313024     1%    /usr/usr
tank/var            654433024   1120000 653313024     0%    /var
tank/var/log        653400960     87936 653313024     0%    /var/log
tank/var/spool      653337088     24064 653313024     0%    /var/spool
/dev/md0               253678        14    233370     0%    /tmp

nope

randy
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Thomas Backman
Date: Wednesday, May 27, 2009 - 12:20 am

I ran into this crash (I think*) yesterday too, albeit in a (amd64) VM  
with 768MB RAM.
However, I had set arc_min="30M" and arc_max="100M" so I expected it  
to work, but
it crashed within 10-15 minutes of make -j4 buildworld. I changed the  
values to 5 and 30M,
and so far (~30 minutes) no crash. The sources were from late May  
21st, currently building
rev. 192805 (since 192808 broke the build, at least on the tinderbox).

* "I think" because I went to check on it it the middle of the night,  
saw a page fault in kernel mode
or whatever, and figured "damnit... well, I'll suspend the VM, turn  
the laptop off and check in the morning".
I hit shutdown instead, so no backtrace or anything. D'oh!

Regards,
Thomas
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Kip Macy
Date: Wednesday, May 27, 2009 - 12:31 am

Can you try not setting the ARC?
I haven't had any problems on my comparably sized VMs.

-Kip
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Thomas Backman
Date: Wednesday, May 27, 2009 - 12:43 am

Uh oh, I think I replied to the wrong thread. After reading the PR in  
question, this doesn't appear to be the same problem that I'm having  
(which appears to be the ARC growing until it panics). Anyway, when  
the build is complete and all that (~2.5 hours to go, plus other stuff  
after that), I'll try again with no ARC settings, when I have the time.

Regards,
Thomas
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
From: Thomas Backman
Date: Wednesday, May 27, 2009 - 2:50 am

OK, I tried it, since it crashed even with my low ARC settings. With  
*no* ARC settings, I get this:

cc -O2 -pipe -I. -DIN_GCC -DHAVE_CONFIG_H -DPREFIX=\"/usr\" -I/usr/obj/ 
usr/src/gnu/usr.bin/cc/cc_tools/../cc_tools -I/usr/src/gnu/usr.bin/cc/ 
cc_tools/../cc_tools -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../ 
contrib/gcc -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/gcc/ 
config -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/gcclibs/ 
include -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/gcclibs/ 
libcpp/include -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/ 
gcclibs/libdecnumber -g -DGENERATOR_FILE -DHAVE_CONFIG_H   -I/usr/obj/ 
usr/src/tmp/legacy/usr/include -c /usr/src/gnu/usr.bin/cc/ 
cc_tools/../../../../contrib/gcc/genattr.c
cc -O2 -pipe -I. -DIN_GCC -DHAVE_CONFIG_H -DPREFIX=\"/usr\" -I/usr/obj/ 
usr/src/gnu/usr.bin/cc/cc_tools/../cc_tools -I/usr/src/gnu/usr.bin/cc/ 
cc_tools/../cc_tools -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../ 
contrib/gcc -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/gcc/ 
config -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/gcclibs/ 
include -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/gcclibs/ 
libcpp/include -I/usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/ 
gcclibs/libdecnumber -g -DGENERATOR_FILE -DHAVE_CONFIG_H   -I/usr/obj/ 
usr/src/tmp/legacy/usr/include -c /usr/src/gnu/usr.bin/cc/ 
cc_tools/../../../../contrib/gcc/genautomata.c
*** drop to debugger here ***

---------------

# while :; do date; vmstat -m | grep -E 'Type|solaris'; sysctl  
kstat.zfs.misc.arcstats.size; sleep 10; done

[...]

Wed May 27 11:33:43 CEST 2009
          Type InUse MemUse HighUse Requests  Size(s)
       solaris 44183 109686K       -  9175781   
16,32,64,128,256,512,1024,2048,4096
kstat.zfs.misc.arcstats.size: 159089184
Wed May 27 11:33:53 CEST 2009
          Type InUse MemUse HighUse Requests  Size(s)
       solaris 37633 108437K       -  9536555   ...
From: Thomas Backman
Date: Wednesday, May 27, 2009 - 5:39 am

(Sorry if the quoting got FUBAR.)
I tried this "once" more, with 1GB VM RAM, no ARC settings and a 4GB  
swap for dumps. (Apparently, when I had 640MB VM RAM, the dump created  
was ~1150 MB. I figured it couldn't exceed RAM size.)
Between the previous tests and this, I had loads of crashes (with  
640MB), even so bad that I couldn't boot because savecore would cause  
a panic. I increased VM RAM and set arc_max in the loader and it  
booted fine. Then, I tried (see below) with 1GB VM RAM and again no  
ARC settings.

The wired count grew and grew and grew, until it crashed in  
lzjb_decompress(), backtrace:
lzjb_decompress()
zio_decompress()
zio_done()
zio_execute()
zio_done()
zio_execute()
taskq_thread()
fork_exit()
fork_trampoline()

On "call doadump" I got "Fatal double fault", no dump, and a reboot.

Here's a LONG output of some vmstat output while running buildworld - 
j4. Note how the wired count keeps increasing and increasing until it  
breaks (in part, I guess this is intended, but it seems to grow a tad  
out of hand):

Regards,
Thomas

------ NO arc settings below! ------

[serenity@clone ~]$ while :; do date; echo; vmstat -s | grep -E 'pages  
(cached|active|wired down|free$)'; echo; sleep 20; done
Wed May 27 13:54:50 CEST 2009

     15716 pages cached
     41358 pages active
     63244 pages wired down
    141368 pages free

Wed May 27 13:55:10 CEST 2009

     15719 pages cached
     65571 pages active
     65515 pages wired down
    113358 pages free

Wed May 27 13:55:30 CEST 2009

     15719 pages cached
     41471 pages active
     70436 pages wired down
    132093 pages free

Wed May 27 13:55:50 CEST 2009

     15809 pages cached
     11660 pages active
     81518 pages wired down
    153294 pages free

Wed May 27 13:56:10 CEST 2009

     16124 pages cached
     11013 pages active
     87862 pages wired down
    147672 pages free

Wed May 27 13:56:30 CEST 2009

     16124 pages cached
     10917 pages active
     ...
Previous thread: uplcom write:Device not configured by Marten Vijn on Monday, May 25, 2009 - 2:50 pm. (8 messages)

Next thread: nagios dies with signal 10 by Stefan Bethke on Monday, May 25, 2009 - 9:37 pm. (6 messages)