doing a "ping -f -l 3" on my host towards my board on linus tree as of Friday results in lots of: nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. __ratelimit: 11 callbacks suppressed nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. for ucc_geth on a MPC832x. This really looks strange to me, ideas? Jocke --
What does /proc/net/netfilter/nf_conntrack show? --
There is no /proc/net/netfilter/nf_conntrack. There is a /proc/net/nf_conntrack though and it is empty. If I telnet to the board I see: ipv4 2 tcp 6 431990 ESTABLISHED src=192.168.1.15 dst=192.168.1.17 sport=56445 dport=23 src=192.168.1.17 dst=192.168.1.15 sport=23 dport=56445 [ASSURED] use=1 ipv4 2 udp 17 7 src=192.168.1.228 dst=192.168.1.255 sport=138 dport=138 [UNREPLIED] src=192.168.1.255 dst=192.168.1.228 sport=138 dport=138 use=1 ipv4 2 udp 17 20 src=127.0.0.1 dst=127.0.0.1 sport=34261 dport=53 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=53 dport=34261 use=1 ipv4 2 udp 17 2 src=192.168.1.199 dst=192.168.1.255 sport=138 dport=138 [UNREPLIED] src=192.168.1.255 dst=192.168.1.199 sport=138 dport=138 use=1 ipv4 2 udp 17 20 src=127.0.0.1 dst=127.0.0.1 sport=40417 dport=53 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=53 dport=40417 use=1 Jocke --
That means that something is leaking conntrack references, most likely by leaking skbs. Since I haven't seen any other reports, my guess would be the ucc_geth driver. --
hmm, I cannot see what in the ucc_geth driver is possibly "leaking". One thing I do notice is that the board becomes almost unresponsive during the ping flood. Perhaps it is building up a backlog of conntracks during the ping flood? Jocke --
Mucking around with the ucc_geth driver I found that if I: - Move TX from IRQ to NAPI context - double the weight. - after booting up, wait a few mins until the JFFS2 GC kernel thread has stopped scanning the FS Then the "nf_conntrack: table full, dropping packet." msgs stops. Does this seem right to you guys? Jocke --
How many cpus do you have ? What kernel version do you use ? --
I suspect RCU problem. Maybe the GC kernel threads blocks synchronize_rcu() ? --
No. As I said, something seems to be leaking packets. You should be able to confirm that by checking the sk_buff slabs in /proc/slabinfo. If that *doesn't* show any signs of a leak, please run "conntrack -E" to capture the conntrack events before the "table full" message appears and post the output. --
skbuff does not differ much, but others do Before ping: During ping: This feels more like the freeing of conntrack objects are delayed and builds up when ping flooding. Don't have "conntrack -E" for my embedded board so that will have to wait a bit longer. Jocke --
I dont understand how your ping can use so many conntrack entries... Then, as I said yesterday, I believe you have a RCU delay, because of a misbehaving driver or something... grep RCU .config grep CONFIG_SMP .config You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 80) as a workaround. It should force a quiescent state after 1000 freed conntracks. --
grep RCU .config # RCU Subsystem CONFIG_CLASSIC_RCU=y # CONFIG_TREE_RCU is not set # CONFIG_PREEMPT_RCU is not set # CONFIG_TREE_RCU_TRACE is not set # CONFIG_PREEMPT_RCU_TRACE is not set # CONFIG_RCU_TORTURE_TEST is not set grep CONFIG_SMP .config conntracks. right, doing this almost killed all conntrack messages, had to stress it pretty hard before I saw handful "nf_conntrack: table full, dropping packet" RCU is not my cup of tea, do you have any ideas were to look? Jocke --
In a stress situation, you feed more deleted conntracks to call_rcu() than
the blimit (10 real freeing per RCU softirq invocation).
So with default qhimark being 10000, this means about 10000 conntracks
can sit in RCU (per CPU) before being really freed.
Only when hitting 10000, RCU enters a special mode to free all queued items, instead
of a small batch of 10
To solve your problem we can :
1) reduce qhimark from 10000 to 1000 (for example)
Probably should be done to reduce some spikes in RCU code when freeing
whole 10000 elements...
OR
2) change conntrack tunable (max conntrack entries on your machine)
OR
3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
in nf_conntrack_free() instead of callback.
[PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
accumulate about 10.000 elements per CPU in its internal queues. To get accurate
conntrack counts (at the expense of slightly more RAM used), we might consider
conntrack counter not taking into account "about to be freed elements, waiting
in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
callback.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index f4935e3..6478dc7 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
static void nf_conntrack_free_rcu(struct rcu_head *head)
{
struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
- struct net *net = nf_ct_net(ct);
nf_ct_ext_free(ct);
kmem_cache_free(nf_conntrack_cachep, ct);
- atomic_dec(&net->ct.count);
}
void nf_conntrack_free(struct nf_conn *ct)
{
+ struct net *net = nf_ct_net(ct);
+
nf_ct_ext_destroy(ct);
+ atomic_dec(&net->ct.count);
call_rcu(&ct->rcu, nf_conntrack_free_rcu);
}
...I forgot to say this is what we do for 'struct file' freeing as well. We
decrement nr_files in file_free(), not in file_free_rcu()
static inline void file_free_rcu(struct rcu_head *head)
{
struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
put_cred(f->f_cred);
kmem_cache_free(filp_cachep, f);
}
static inline void file_free(struct file *f)
{
percpu_counter_dec(&nr_files); <<<< HERE >>>>
file_check_state(f);
call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
}
--
While temporarily exceeding the limit by up to 10000 entries is quite a lot, I guess the important thing is that it can't grow unbounded, so I think this patch is fine. --
Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() queueing problem. That would better use CPU caches as well... --
I'm not sure I understand the rules correctly, but we'd still have to wait for the grace period before an object can be reused, no? --
No we dont have to, but we must do additionnal checks after getting
a reference on object found on lookup.
(We must re-check the keys used during search)
This re-check is not very expensive since everything is hot in cpu cache.
Check Documentation/RCU/rculist_nulls.txt for some documentation.
1) Lookup algo
--------------
rcu_read_lock()
begin:
obj = lockless_lookup(key);
if (obj) {
if (!try_get_ref(obj)) // might fail for free objects
goto begin;
/*
* Because a writer could delete object, and a writer could
* reuse these object before the RCU grace period, we
* must check key after geting the reference on object
*/
if (obj->key != key) { // not the object we expected
put_ref(obj);
goto begin;
}
}
rcu_read_unlock();
--
The patch fixes the problem and the system feels a bit more responsive too, thanks. I guess I should probably do both 1) and 3) as my board is pretty slow too. Been trying to figure out a good value for NAPI weigth too. Currently my HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move TX processing to NAPI context AND increase weigth to 32, the system is a lot more responsive during ping flooding. Does weigth 32 make sense when the HW TX and RX queues are 16? Jocke --
If you only have one NIC, I dont understand why changing weight should make a difference. Are you referring to dev_weight or netdev_budget ? # cat /proc/sys/net/core/dev_weight 64 # cat /proc/sys/net/core/netdev_budget 300 --
Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 14:29:29: I mean this call in ucc_geth: netif_napi_add(dev, &ugeth->napi, ucc_geth_poll, UCC_GETH_DEV_WEIGHT); UCC_GETH_DEV_WEIGHT is 16 Noticed that rcuclassic.c has a module_param(qhimark, int, 0); But I can't figure out hot to set this qhimark from the cmdline. rcuclassic.c is not a module(I don't use modules at all) Jocke Jocke --
On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote: Your patch fixes the problem on my board too (embedded mips router 250Mhz), thanks. Yet I'm concerned about what you said concerning RAM usage. I have a very small amount on memory left on my board (less than 4M), and I tuned ip route cache size and nf_conntrack_max to make sure I won't go OOM. With your patch, does it mean 10000 conntrack entries can be allocated while nf_conntrack_max is say only 2048 ? Regards, -- Maxime --
Temporarily under worst-case circumstances, yes. Eric is already working on his proposed improvement though :) --
Well... yes, RCU can have this 'interesting' OOM property. For small machines, you really want to lower RCU parameters, because as you said, we also push route cache entries in RCU queue, my patch being applied or not (But using call_rcu_bh(), so we have lower latencies I think) We are working on a SLAB_DESTROY_BY_RCU implementation so that conntrack wont use call_rcu() anymore, give us a couple of days :) Paul, could we have /sys knobs to be able to tune qhimark, blimit & qlowmark ? Thanks --
While working on this stuff, I found one suspect use of hlist_add_head() Its not a hot path, I believe following patch would make sure nothing wrong happens. If a chain contains element A and B, then we might build a new table with a new chain containing B and A (in this reverse order), and a cpu could see A->next = B (new pointer), B->next = A (old pointer) Thanks [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Using hlist_add_head() in nf_conntrack_set_hashsize() is quite dangerous. Without any barrier, one CPU could see a loop while doing its lookup. Its true new table cannot be seen by another cpu, but previous table is still readable. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 55befe5..54e983f 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1121,7 +1121,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) struct nf_conntrack_tuple_hash, hnode); hlist_del_rcu(&h->hnode); bucket = __hash_conntrack(&h->tuple, hashsize, rnd); - hlist_add_head(&h->hnode, &hash[bucket]); + hlist_add_head_rcu(&h->hnode, &hash[bucket]); } } old_size = nf_conntrack_htable_size; --
Hi Patrick Here is the patch I had the time to test this time... No problem so far on my machine. I did a UDP flood stress. Thank you [PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu() Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP. This permits an easy conversion from call_rcu() based hash lists to a SLAB_DESTROY_BY_RCU one. Avoiding call_rcu() delay at nf_conn freeing time has numerous gains. First, it doesnt fill RCU queues (up to 10000 elements per cpu). This reduces OOM possibility, if queued elements are not taken into account This reduces latency problems when RCU queue size hits hilimit and triggers emergency mode. - It allows fast reuse of just freed elements, permitting better use of CPU cache. - We delete rcu_head from "struct nf_conn", shrinking size of this structure by 8 or 16 bytes. This patch only takes care of "struct nf_conn". call_rcu() is still used for less critical conntrack parts, that may be converted later if necessary. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/net/netfilter/nf_conntrack.h | 14 - include/net/netfilter/nf_conntrack_tuple.h | 6 include/net/netns/conntrack.h | 5 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 16 - net/ipv4/netfilter/nf_nat_core.c | 2 net/netfilter/nf_conntrack_core.c | 123 +++++----- net/netfilter/nf_conntrack_expect.c | 2 net/netfilter/nf_conntrack_helper.c | 7 net/netfilter/nf_conntrack_netlink.c | 10 net/netfilter/nf_conntrack_standalone.c | 16 - net/netfilter/xt_connlimit.c | 4 11 files changed, 114 insertions(+), 91 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 4dfb793..6c3f964 100644 --- ...
Don't we need to make sure the entry is not reused while dumping This should be "1" I think since it wants a hlist_nulls hash. --
OK I just realized my mistake, please ignore :) --
Ah yes, I forgot that for UDP/TCP I had to change locking on this part. Because messing with reference count was crazy... But in UDP/TCP we have different spinlock for each chain, so hold time was small enough. So I guess that with central conntrack lock, we need to take references on entries while dumping them. --
Here is take 2 of the patch with proper ref counting on dumping. Thank you [PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu() Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP. This permits an easy conversion from call_rcu() based hash lists to a SLAB_DESTROY_BY_RCU one. Avoiding call_rcu() delay at nf_conn freeing time has numerous gains. First, it doesnt fill RCU queues (up to 10000 elements per cpu). This reduces OOM possibility, if queued elements are not taken into account This reduces latency problems when RCU queue size hits hilimit and triggers emergency mode. - It allows fast reuse of just freed elements, permitting better use of CPU cache. - We delete rcu_head from "struct nf_conn", shrinking size of this structure by 8 or 16 bytes. This patch only takes care of "struct nf_conn". call_rcu() is still used for less critical conntrack parts, that may be converted later if necessary. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/net/netfilter/nf_conntrack.h | 14 - include/net/netfilter/nf_conntrack_tuple.h | 6 include/net/netns/conntrack.h | 5 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 57 ++-- net/ipv4/netfilter/nf_nat_core.c | 2 net/netfilter/nf_conntrack_core.c | 123 +++++----- net/netfilter/nf_conntrack_expect.c | 2 net/netfilter/nf_conntrack_helper.c | 7 net/netfilter/nf_conntrack_netlink.c | 20 - net/netfilter/nf_conntrack_standalone.c | 51 ++-- net/netfilter/xt_connlimit.c | 6 11 files changed, 166 insertions(+), 127 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 4dfb793..6c3f964 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -91,8 ...
Can we assume the next pointer still points to the next entry in the same chain after the refcount dropped to zero? --
We are looking chain N. If we cannot atomic_inc() refcount, we got some deleted entry. If we could atomic_inc, we can meet an entry that just moved to another chain X When hitting its end, we continue the search to the N+1 chain so we only skip the end of previous chain (N). We can 'forget' some entries, we can print several time one given entry. We could solve this by : 1) Checking hash value : if not one expected -> Going back to head of chain N, (potentially re-printing already handled entries) So it is not a *perfect* solution. 2) Use a locking to forbid writers (as done in UDP/TCP), but it is expensive and wont solve other problem : We wont avoid emitting same entry several time anyway (this is a flaw of current seq_file handling, since we 'count' entries to be skiped, and this is wrong if some entries were deleted or inserted meanwhile) We have same problem on /proc/net/udp & /proc/net/tcp, I am not sure we should care... Also, current resizing code can give to a /proc/net/ip_conntrack reader a problem, since hash table can switch while its doing its dumping : many entries might be lost or regiven... --
I think double entries are not a problem, as you say, there are already other cases where this can happen. But I think we should try our best that every entry present at the start and still present at the end of a dump is also contained in the dump, otherwise the guantees seem to weak to still be useful. Thats true. But its a very rare operation, so I think its mainly a documentation issue. --
If your concern is to not forget entries, and we are allowed to print some entries several times,
then we can just check the final "nulls" value, and if we find a different value than expected for
chain N, go back to begining of chain N.
No need to check hash value (this could help not print several time same entry, we dont care that much)
+ while (is_a_nulls(head)) {
+ if (likely(get_nulls_value(head) == st->bucket)) {
+ if (++st->bucket >= nf_conntrack_htable_size)
+ return NULL;
+ }
Thank you
[PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()
Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.
This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.
Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.
First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.
- It allows fast reuse of just freed elements, permitting better use of
CPU cache.
- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.
This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
include/net/netfilter/nf_conntrack.h | 14 -
include/net/netfilter/nf_conntrack_tuple.h | 6
include/net/netns/conntrack.h | 5
net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 63 ++---
net/ipv4/netfilter/nf_nat_core.c | 2
net/netfilter/nf_conntrack_core.c | 123 +++++-----
net/netfilter/nf_conntrack_expect.c | 2
...Just add "rcuclassic.qhimark=2048" to your cmdline. Jocke --
