Hi again
It seems i am having very bad luck with 2.6.27. As Linus told, it have to be
released soon, but it is crashing like hell on high network load.Even i change FIB to HASH, still after 3-4 hours running it stuck. And
NOTHING helps. Software watchdog (hardware iTCO_wdt probably not working at
all on ICH9 Intel boards),nmi_watchdog (even panic on oops set, and reboot on
panic set too), hangcheck-timer, shell script with ping watchdog. Nothing
helps. So probably people with heavy loaded machines without IPMI and
management cards will be very "happy".
I have now watchdog just doing malloc/free loop, i will try to modify it, so
it will send \000 only when ping succeeded.One crash i had netconsole enabled, next crash it was disabled.
Here is a message i got over syslog on last crash (it was 2.6.25-rc6-git6),
available also at http://www.nuclearcat.com/files/crash_2.6.25.txtMar 26 02:27:14 ROUTER [ 4698.694693] BUG: NMI Watchdog detected LOCKUP
Mar 26 02:27:14 ROUTER on CPU1, ip c02ad109, registers:
Mar 26 02:27:14 ROUTER [ 4698.694693] Process snmpd (pid: 2327, ti=c092e000
task=f7459080 task.ti=f70b7000)
Mar 26 02:27:14 ROUTER
Mar 26 02:27:14 ROUTER [ 4698.694693] Stack:
Mar 26 02:27:14 ROUTER c092eb14
Mar 26 02:27:14 ROUTER c011991e
Mar 26 02:27:14 ROUTER f750d600
Mar 26 02:27:14 ROUTER f750d600
Mar 26 02:27:14 ROUTER c0378058
Mar 26 02:27:14 ROUTER 00000001
Mar 26 02:27:14 ROUTER c092eb34
Mar 26 02:27:14 ROUTER c0119b3b
Mar 26 02:27:14 ROUTER
Mar 26 02:27:14 ROUTER [ 4698.694693]
Mar 26 02:27:14 ROUTER 00000000
Mar 26 02:27:14 ROUTER 00000001
Mar 26 02:27:14 ROUTER 00000082
Mar 26 02:27:14 ROUTER f708af88
Mar 26 02:27:14 ROUTER c0378058
Mar 26 02:27:14 ROUTER 00000001
Mar 26 02:27:14 ROUTER c092eb3c
Mar 26 02:27:14 ROUTER c0119bfe
Mar 26 02:27:14 ROUTER
Mar 26 02:27:14 ROUTER [ 4698.694693]
Mar 26 02:27:14 ROUTER c092eb50
Mar 26 02:27:14 ROUTER c012f19c
Mar 26 02:27:14 ROUTER 00000000
Mar 26 02:27:14 ROUTER f7...
From: "Denys Fedoryshchenko" <denys@visp.net.lb>
That's amazing, you've taken a trip into the future and are running
2.6.27 already, please let me borrow your time machine :-)More seriously, there is obviously something very unique to your
setup or else everyone would be reporting this crash, and we have
to find out what that might be.There seems to be bunch of netfilter stuff in your traces, but
the top of the trace is somewhere totally unrelated. This is
a common reoccurance in your crash traces, making them less
useful than they could be.I know you asked before what can be done to improve the traces,
but I'm not an x86 expert so I have no idea how to help you
in that area.Patrick, could you see if you can make any sense of his log?
I see conttrack a lot in the backtraces.--
The conntrack stuff looks harmless, I went through the code just
to make sure, but I can't see anything wrong there.
--
Sorry, mixed 2.6.25-rc7 (rc7 seems migrated to 2X :-))
sleepless night.I can provide image of system (running on 128MB USB flash), it is Core 2 Duo
CPU, 3xe100, 1xe1000e, nat, shaping over ifb with htb, sure some iptables
filtering rules. But the problem to simulate the load. Biggest problem, why
it doesn't reboot with all this watchdog stuff enabled, it makes serious
headache by that. Now we are installing power switch, so it will help somehow.--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.--
I can see rt_garbage_collect() involved here. This one might explain very long
delays in softirq processing, and eventually crashes...Denys, could you post :
grep . /proc/sys/net/ipv4/route/*
rtstat -c1 -i10
--
Yes, it seems related to routing. Before such thing was not happening (maybe
because TRIE was operating better?).Here is info at "peak time", i disable nmi_watchdog now, so garbage collector
will not be triggered by nmi watchdog.rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
rt_cache|
entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti|
out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|
out_hlis|
| | tot| mc| ute| | an_dst|
an_src| | _tot| _mc| | ed| miss| verflow|
_search|t_search|
247011| 2255410| 162866| 0| 610| 5| 0| 0|
33761| 8574| 0| 165475| 165307| 0| 0|10407428| 174948|
251810| 43087| 2782| 0| 9| 0| 0| 0|
618| 122| 0| 2912| 2909| 0| 0| 241946| 3796|
256277| 43035| 2739| 0| 9| 0| 0| 0|
595| 121| 0| 2867| 2864| 0| 0| 243724| 3748|
260177| 43596| 2647| 0| 8| 0| 0| 0|
672| 123| 0| 2778| 2776| 0| 0| 246880| 4048|
232741| 42270| 2759| 0| 15| 0| 0| 0|
665| 135| 0| 2910| 2907| 0| 0| 233990| 3938|
226623| 42615| 2792| 0| 11| 0| 0| 0|
723| 132| 0| 2935| 2932| 0| 0| 218378| 3862|
233190| 42397| 2778| 0| 8| 0| 0| 0|
675| 128| 0| 2913| 2909| 0| 0| 214258| 3703|
239093| 42342| 2713| 0| 9| 0| 0| 0|
764| 126| 0| 2847| 2845| 0| 0| 216453| 4080|
150539| 36992| 7564| 0| 58| 0| 0| 0| ...
You want to tune route cache for your special needs, and not permit it
to store 5 millions entries !# default is a gc every 60 seconds, not good for large caches
echo 1 >/proc/sys/net/ipv4/route/gc_interval
# default is 8 entries per slot..
echo 4 >/proc/sys/net/ipv4/route/gc_elasticity
# avoid a flush every 10 minutes
echo 3600 >/proc/sys/net/ipv4/route/secret_interval--
After some time
Kup /config # grep . /proc/sys/net/ipv4/route/*
/proc/sys/net/ipv4/route/error_burst:5000
/proc/sys/net/ipv4/route/error_cost:1000
grep: /proc/sys/net/ipv4/route/flush: Permission denied
/proc/sys/net/ipv4/route/gc_elasticity:8
/proc/sys/net/ipv4/route/gc_interval:60
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:32768
/proc/sys/net/ipv4/route/gc_timeout:300
/proc/sys/net/ipv4/route/max_size:524288
/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600
/proc/sys/net/ipv4/route/redirect_load:20
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:20480
/proc/sys/net/ipv4/route/secret_interval:600
Kup /config #
Kup /config # rtstat -c1 -i10
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
rt_cache|
entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti|
out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|
out_hlis|
| | tot| mc| ute| | an_dst|
an_src| | _tot| _mc| | ed| miss| verflow|
_search|t_search|
337510| 8928889| 1855387| 0| 14286| 69| 0| 0|
110840| 58108| 0| 1922232| 1920809| 377| 0|20908744|
294715|
Kup /config #--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.--
Kup ~ # grep . /proc/sys/net/ipv4/route/*
/proc/sys/net/ipv4/route/error_burst:5000
/proc/sys/net/ipv4/route/error_cost:1000
grep: /proc/sys/net/ipv4/route/flush: Permission denied
/proc/sys/net/ipv4/route/gc_elasticity:8
/proc/sys/net/ipv4/route/gc_interval:60
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:32768
/proc/sys/net/ipv4/route/gc_timeout:300
/proc/sys/net/ipv4/route/max_size:524288
/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600
/proc/sys/net/ipv4/route/redirect_load:20
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:20480
/proc/sys/net/ipv4/route/secret_interval:600Kup ~ # rtstat -c1 -i10
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
rt_cache|
entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti|
out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|
out_hlis|
| | tot| mc| ute| | an_dst|
an_src| | _tot| _mc| | ed| miss| verflow|
_search|t_search|
247734| 6225923| 482238| 0| 1472| 32| 0| 0|
75983| 30203| 0| 508575| 507658| 4| 0|18714805| 250265|
Kup ~ # rtstat -c1 -i10
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
rt_cache|
entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti|
out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|
out_hlis|
| | tot| mc| ute| | an_dst|
an_src| | _tot| _mc| | ed| miss| verflow|
_search|t_search|
246851| 6356930| 496921| 0| 1535| 33| 0| 0|
77469| 30908|...
| Benjamin Herrenschmidt | Re: [PATCH] Remove process freezer from suspend to RAM pathway |
| Daniel Walker | Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Andrew Morton | -mm merge plans for 2.6.23 |
git: | |
| David Miller | [GIT]: Networking |
| Hannes Eder | [PATCH 01/43] drivers/net/at1700.c: fix sparse warning: symbol shadows an earlier ... |
| Gerrit Renker | [PATCH 16/37] dccp: API to query the current TX/RX CCID |
| Herbert Xu | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
