Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs

Previous thread: [PATCH] ucc_geth: Convert to net_device_ops by Joakim Tjernlund on Monday, March 23, 2009 - 3:17 am. (11 messages)

Next thread: [PATCH] be2net: cleanup rx/tx rate calculations by Sathya Perla on Monday, March 23, 2009 - 4:51 am. (2 messages)
From: Joakim Tjernlund
Date: Monday, March 23, 2009 - 3:42 am

doing a "ping -f -l 3" on my host towards my board on linus tree as of 
Friday results in lots of:
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
__ratelimit: 11 callbacks suppressed
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.

for ucc_geth on a MPC832x.
This really looks strange to me, ideas?

 Jocke
--

From: Patrick McHardy
Date: Monday, March 23, 2009 - 5:15 am

What does /proc/net/netfilter/nf_conntrack show?
--

From: Joakim Tjernlund
Date: Monday, March 23, 2009 - 5:25 am

There is no /proc/net/netfilter/nf_conntrack. There is a
/proc/net/nf_conntrack though and it is empty. If I telnet
to the board I see:
ipv4     2 tcp      6 431990 ESTABLISHED src=192.168.1.15 dst=192.168.1.17 
sport=56445 dport=23 src=192.168.1.17 dst=192.168.1.15 sport=23 
dport=56445 [ASSURED] use=1
ipv4     2 udp      17 7 src=192.168.1.228 dst=192.168.1.255 sport=138 
dport=138 [UNREPLIED] src=192.168.1.255 dst=192.168.1.228 sport=138 
dport=138 use=1
ipv4     2 udp      17 20 src=127.0.0.1 dst=127.0.0.1 sport=34261 dport=53 
[UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=53 dport=34261 use=1
ipv4     2 udp      17 2 src=192.168.1.199 dst=192.168.1.255 sport=138 
dport=138 [UNREPLIED] src=192.168.1.255 dst=192.168.1.199 sport=138 
dport=138 use=1
ipv4     2 udp      17 20 src=127.0.0.1 dst=127.0.0.1 sport=40417 dport=53 
[UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=53 dport=40417 use=1

 Jocke

--

From: Patrick McHardy
Date: Monday, March 23, 2009 - 5:29 am

That means that something is leaking conntrack references, most likely
by leaking skbs. Since I haven't seen any other reports, my guess would
be the ucc_geth driver.
--

From: Joakim Tjernlund
Date: Monday, March 23, 2009 - 5:59 am

hmm, I cannot see what in the ucc_geth driver is possibly "leaking". One 
thing
I do notice is that the board becomes almost unresponsive during the ping 
flood.
Perhaps it is building up a backlog of conntracks during the ping flood?

 Jocke

--

From: Joakim Tjernlund
Date: Monday, March 23, 2009 - 10:42 am

Mucking around with the ucc_geth driver I found that if I:
 - Move TX from IRQ to NAPI context
 - double the weight.
 - after booting up, wait a few mins until the JFFS2 GC kernel thread has 
stopped
   scanning the FS 

Then the "nf_conntrack: table full, dropping packet." msgs stops.
Does this seem right to you guys?

 Jocke
--

From: Eric Dumazet
Date: Monday, March 23, 2009 - 10:49 am

How many cpus do you have ?

What kernel version do you use ?

--

From: Joakim Tjernlund
Date: Monday, March 23, 2009 - 11:04 am

Linus tree as of Friday

--

From: Eric Dumazet
Date: Monday, March 23, 2009 - 11:08 am

I suspect RCU problem. Maybe the GC kernel threads blocks synchronize_rcu() ?


--

From: Patrick McHardy
Date: Monday, March 23, 2009 - 10:49 am

No. As I said, something seems to be leaking packets. You should be
able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
If that *doesn't* show any signs of a leak, please run "conntrack -E"
to capture the conntrack events before the "table full" message
appears and post the output.


--

From: Joakim Tjernlund
Date: Tuesday, March 24, 2009 - 1:22 am

skbuff does not differ much, but others do

Before ping:

During ping: 

This feels more like the freeing of conntrack objects are delayed and 
builds up when ping flooding.

Don't have "conntrack -E" for my embedded board so that will have to wait 
a bit longer.

 Jocke
--

From: Eric Dumazet
Date: Tuesday, March 24, 2009 - 2:12 am

I dont understand how your ping can use so many conntrack entries...

Then, as I said yesterday, I believe you have a RCU delay, because of
a misbehaving driver or something...

grep RCU .config
grep CONFIG_SMP .config

You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 80)
as a workaround. It should force a quiescent state after 1000 freed conntracks.


--

From: Joakim Tjernlund
Date: Tuesday, March 24, 2009 - 3:55 am

grep RCU .config
# RCU Subsystem
CONFIG_CLASSIC_RCU=y
# CONFIG_TREE_RCU is not set
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_RCU_TORTURE_TEST is not set
grep CONFIG_SMP .config
conntracks.

right, doing this almost killed all conntrack messages, had to stress it 
pretty
hard before I saw handful "nf_conntrack: table full, dropping packet"

RCU is not my cup of tea, do you have any ideas were to look?

 Jocke

--

From: Eric Dumazet
Date: Tuesday, March 24, 2009 - 5:07 am

In a stress situation, you feed more deleted conntracks to call_rcu() than
the blimit (10 real freeing per RCU softirq invocation). 

So with default qhimark being 10000, this means about 10000 conntracks
can sit in RCU (per CPU) before being really freed.

Only when hitting 10000, RCU enters a special mode to free all queued items, instead
of a small batch of 10

To solve your problem we can :

1) reduce qhimark from 10000 to 1000 (for example)
   Probably should be done to reduce some spikes in RCU code when freeing
   whole 10000 elements...
OR
2) change conntrack tunable (max conntrack entries on your machine)
OR
3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
  in nf_conntrack_free() instead of callback.

[PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()

We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
accumulate about 10.000 elements per CPU in its internal queues. To get accurate
conntrack counts (at the expense of slightly more RAM used), we might consider
conntrack counter not taking into account "about to be freed elements, waiting
in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
callback.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index f4935e3..6478dc7 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 static void nf_conntrack_free_rcu(struct rcu_head *head)
 {
 	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
-	struct net *net = nf_ct_net(ct);
 
 	nf_ct_ext_free(ct);
 	kmem_cache_free(nf_conntrack_cachep, ct);
-	atomic_dec(&net->ct.count);
 }
 
 void nf_conntrack_free(struct nf_conn *ct)
 {
+	struct net *net = nf_ct_net(ct);
+
 	nf_ct_ext_destroy(ct);
+	atomic_dec(&net->ct.count);
 	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
 }
 ...
From: Eric Dumazet
Date: Tuesday, March 24, 2009 - 5:25 am

I forgot to say this is what we do for 'struct file' freeing as well. We
decrement nr_files in file_free(), not in file_free_rcu()

static inline void file_free_rcu(struct rcu_head *head)
{
        struct file *f = container_of(head, struct file, f_u.fu_rcuhead);

        put_cred(f->f_cred);
        kmem_cache_free(filp_cachep, f);
}

static inline void file_free(struct file *f)
{
        percpu_counter_dec(&nr_files);      <<<< HERE >>>>
        file_check_state(f);
        call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
}



--

From: Patrick McHardy
Date: Tuesday, March 24, 2009 - 5:43 am

While temporarily exceeding the limit by up to 10000 entries is
quite a lot, I guess the important thing is that it can't grow
unbounded, so I think this patch is fine.

--

From: Eric Dumazet
Date: Tuesday, March 24, 2009 - 6:32 am

Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() queueing
problem. That would better use CPU caches as well...

--

From: Patrick McHardy
Date: Tuesday, March 24, 2009 - 6:38 am

I'm not sure I understand the rules correctly, but we'd still
have to wait for the grace period before an object can be reused,
no?



--

From: Eric Dumazet
Date: Tuesday, March 24, 2009 - 6:47 am

No we dont have to, but we must do additionnal checks after getting
a reference on object found on lookup.
(We must re-check the keys used during search)

This re-check is not very expensive since everything is hot in cpu cache.

Check Documentation/RCU/rculist_nulls.txt for some documentation.

1) Lookup algo
--------------

rcu_read_lock()
begin:
obj = lockless_lookup(key);
if (obj) {
  if (!try_get_ref(obj)) // might fail for free objects
    goto begin;
  /*
   * Because a writer could delete object, and a writer could
   * reuse these object before the RCU grace period, we
   * must check key after geting the reference on object
   */
  if (obj->key != key) { // not the object we expected
     put_ref(obj);
     goto begin;
   }
}
rcu_read_unlock();


--

From: Joakim Tjernlund
Date: Tuesday, March 24, 2009 - 6:20 am

The patch fixes the problem and the system feels a bit more responsive 
too, thanks.
I guess I should probably do both 1) and 3) as my board is pretty slow 
too.

Been trying to figure out a good value for NAPI weigth too. Currently my
HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move TX 
processing
to NAPI context AND increase weigth to 32, the system is a lot more 
responsive during
ping flooding. Does weigth 32 make sense when the HW TX and RX queues are 
16?

 Jocke

--

From: Patrick McHardy
Date: Tuesday, March 24, 2009 - 6:28 am

Applied, thanks everyone.
--

From: Eric Dumazet
Date: Tuesday, March 24, 2009 - 6:29 am

If you only have one NIC, I dont understand why changing weight should make
a difference. Are you referring to dev_weight or netdev_budget ?

# cat /proc/sys/net/core/dev_weight
64
# cat /proc/sys/net/core/netdev_budget
300


--

From: Joakim Tjernlund
Date: Tuesday, March 24, 2009 - 6:41 am

Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 14:29:29:


I mean this call in ucc_geth:
  netif_napi_add(dev, &ugeth->napi, ucc_geth_poll, UCC_GETH_DEV_WEIGHT);
UCC_GETH_DEV_WEIGHT is 16

Noticed that rcuclassic.c has a 
  module_param(qhimark, int, 0);
But I can't figure out hot to set this qhimark from the cmdline.
 rcuclassic.c is not a module(I don't use modules at all)

 Jocke
 Jocke


--

From: Maxime Bizon
Date: Tuesday, March 24, 2009 - 8:17 am

On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote:


Your patch fixes the problem on my board too (embedded mips router
250Mhz), thanks.

Yet I'm concerned about what you said concerning RAM usage. I have a
very small amount on memory left on my board (less than 4M), and I tuned
ip route cache size and nf_conntrack_max to make sure I won't go OOM.

With your patch, does it mean 10000 conntrack entries can be allocated
while nf_conntrack_max is say only 2048 ?

Regards,

-- 
Maxime


--

From: Patrick McHardy
Date: Tuesday, March 24, 2009 - 8:21 am

Temporarily under worst-case circumstances, yes. Eric is already working
on his proposed improvement though :)
--

From: Eric Dumazet
Date: Tuesday, March 24, 2009 - 8:27 am

Well... yes, RCU can have this 'interesting' OOM property.

For small machines, you really want to lower RCU parameters, because
as you said, we also push route cache entries in RCU queue, my patch
being applied or not (But using call_rcu_bh(), so we have lower latencies
I think)

We are working on a SLAB_DESTROY_BY_RCU implementation so that
conntrack wont use call_rcu() anymore, give us a couple of days :)

Paul, could we have /sys knobs to be able to tune qhimark, blimit & qlowmark ?

Thanks

--

From: Eric Dumazet
Date: Tuesday, March 24, 2009 - 12:54 pm

While working on this stuff, I found one suspect use of hlist_add_head()

Its not a hot path, I believe following patch would make sure nothing
wrong happens.

If a chain contains element A and B, then we might build a new table
with a new chain containing B and A (in this reverse order), and
a cpu could see A->next = B (new pointer),  B->next = A (old pointer)

Thanks

[PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize()

Using hlist_add_head() in nf_conntrack_set_hashsize() is quite dangerous.
Without any barrier, one CPU could see a loop while doing its lookup.
Its true new table cannot be seen by another cpu, but previous table is still
readable.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 55befe5..54e983f 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1121,7 +1121,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 					struct nf_conntrack_tuple_hash, hnode);
 			hlist_del_rcu(&h->hnode);
 			bucket = __hash_conntrack(&h->tuple, hashsize, rnd);
-			hlist_add_head(&h->hnode, &hash[bucket]);
+			hlist_add_head_rcu(&h->hnode, &hash[bucket]);
 		}
 	}
 	old_size = nf_conntrack_htable_size;


--

From: Patrick McHardy
Date: Wednesday, March 25, 2009 - 9:26 am

Applied, thanks Eric.
--

From: Eric Dumazet
Date: Wednesday, March 25, 2009 - 10:53 am

Hi Patrick

Here is the patch I had the time to test this time...
No problem so far on my machine.
I did a UDP flood stress.

Thank you

[PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()

Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.

This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.

Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.

First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.

- It allows fast reuse of just freed elements, permitting better use of
CPU cache.

- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.

This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/netfilter/nf_conntrack.h                  |   14 -
 include/net/netfilter/nf_conntrack_tuple.h            |    6
 include/net/netns/conntrack.h                         |    5
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |   16 -
 net/ipv4/netfilter/nf_nat_core.c                      |    2
 net/netfilter/nf_conntrack_core.c                     |  123 +++++-----
 net/netfilter/nf_conntrack_expect.c                   |    2
 net/netfilter/nf_conntrack_helper.c                   |    7
 net/netfilter/nf_conntrack_netlink.c                  |   10
 net/netfilter/nf_conntrack_standalone.c               |   16 -
 net/netfilter/xt_connlimit.c                          |    4
 11 files changed, 114 insertions(+), 91 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 4dfb793..6c3f964 100644
--- ...
From: Patrick McHardy
Date: Wednesday, March 25, 2009 - 11:05 am

Don't we need to make sure the entry is not reused while dumping

This should be "1" I think since it wants a hlist_nulls hash.
--

From: Patrick McHardy
Date: Wednesday, March 25, 2009 - 11:06 am

OK I just realized my mistake, please ignore :)
--

From: Eric Dumazet
Date: Wednesday, March 25, 2009 - 11:15 am

Ah yes, I forgot that for UDP/TCP I had to change locking on this part.
Because messing with reference count was crazy...
But in UDP/TCP we have different spinlock for each chain, so hold time
was small enough.

So I guess that with central conntrack lock, we need to take references on entries
while dumping them.

--

From: Patrick McHardy
Date: Wednesday, March 25, 2009 - 11:24 am

Yes, I think so too.
--

From: Eric Dumazet
Date: Wednesday, March 25, 2009 - 11:53 am

Here is take 2 of the patch with proper ref counting on dumping.

Thank you

[PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()

Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.

This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.

Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.

First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.

- It allows fast reuse of just freed elements, permitting better use of
CPU cache.

- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.

This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/netfilter/nf_conntrack.h                  |   14 -
 include/net/netfilter/nf_conntrack_tuple.h            |    6
 include/net/netns/conntrack.h                         |    5
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |   57 ++--
 net/ipv4/netfilter/nf_nat_core.c                      |    2
 net/netfilter/nf_conntrack_core.c                     |  123 +++++-----
 net/netfilter/nf_conntrack_expect.c                   |    2
 net/netfilter/nf_conntrack_helper.c                   |    7
 net/netfilter/nf_conntrack_netlink.c                  |   20 -
 net/netfilter/nf_conntrack_standalone.c               |   51 ++--
 net/netfilter/xt_connlimit.c                          |    6
 11 files changed, 166 insertions(+), 127 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 4dfb793..6c3f964 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -91,8 ...
From: Patrick McHardy
Date: Wednesday, March 25, 2009 - 12:00 pm

Can we assume the next pointer still points to the next entry
in the same chain after the refcount dropped to zero?


--

From: Eric Dumazet
Date: Wednesday, March 25, 2009 - 12:17 pm

We are looking chain N.
If we cannot atomic_inc() refcount, we got some deleted entry.
If we could atomic_inc, we can meet an entry that just moved to another chain X

When hitting its end, we continue the search to the N+1 chain so we only 
skip the end of previous chain (N). We can 'forget' some entries, we can print
several time one given entry.


We could solve this by :

1) Checking hash value : if not one expected -> 
   Going back to head of chain N, (potentially re-printing already handled entries)
   So it is not a *perfect* solution.

2) Use a locking to forbid writers (as done in UDP/TCP), but it is expensive and
wont solve other problem :

We wont avoid emitting same entry several time anyway (this is a flaw of 
current seq_file handling, since we 'count' entries to be skiped, and this is
wrong if some entries were deleted or inserted meanwhile)

We have same problem on /proc/net/udp & /proc/net/tcp, I am not sure we should care...

Also, current resizing code can give to a /proc/net/ip_conntrack reader a problem, since
hash table can switch while its doing its dumping : many entries might be lost or regiven...


--

From: Patrick McHardy
Date: Wednesday, March 25, 2009 - 12:41 pm

I think double entries are not a problem, as you say, there
are already other cases where this can happen. But I think we
should try our best that every entry present at the start and
still present at the end of a dump is also contained in the
dump, otherwise the guantees seem to weak to still be useful.

Thats true. But its a very rare operation, so I think its mainly
a documentation issue.
--

From: Eric Dumazet
Date: Wednesday, March 25, 2009 - 12:58 pm

If your concern is to not forget entries, and we are allowed to print some entries several times,
then we can just check the final "nulls" value, and if we find a different value than expected for
chain N, go back to begining of chain N.

No need to check hash value (this could help not print several time same entry, we dont care that much)

+       while (is_a_nulls(head)) {
+               if (likely(get_nulls_value(head) == st->bucket)) {
+                       if (++st->bucket >= nf_conntrack_htable_size)
+                               return NULL;
+               }

Thank you

[PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()

Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.

This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.

Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.

First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.

- It allows fast reuse of just freed elements, permitting better use of
CPU cache.

- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.

This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/netfilter/nf_conntrack.h                  |   14 -
 include/net/netfilter/nf_conntrack_tuple.h            |    6
 include/net/netns/conntrack.h                         |    5
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |   63 ++---
 net/ipv4/netfilter/nf_nat_core.c                      |    2
 net/netfilter/nf_conntrack_core.c                     |  123 +++++-----
 net/netfilter/nf_conntrack_expect.c                   |    2
 ...
From: Patrick McHardy
Date: Wednesday, March 25, 2009 - 1:10 pm

Applied, thanks a lot.
--

From: Joakim Tjernlund
Date: Tuesday, March 24, 2009 - 11:29 am

Just add "rcuclassic.qhimark=2048" to your cmdline.

 Jocke

--

Previous thread: [PATCH] ucc_geth: Convert to net_device_ops by Joakim Tjernlund on Monday, March 23, 2009 - 3:17 am. (11 messages)

Next thread: [PATCH] be2net: cleanup rx/tx rate calculations by Sathya Perla on Monday, March 23, 2009 - 4:51 am. (2 messages)