At an unnamed ISP, we experienced a DDoS attack against one of our customers. This attack also caused problems for one of our Linux based routers. The attack was "only" generating 300 kpps (packets per sec), which usually isn't a problem for this (fairly old) Linux Router. But the conntracking system chocked and reduced pps processing power to 40kpps. I do extensive RRD/graph monitoring of the machines. The IP conntrack searches in the period exploded, to a stunning 700.000 searches per sec. http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png First I though it might be caused by bad hashing, but after reading the kernel code (func: __nf_conntrack_find()), I think its caused by the loop restart (goto begin) of the conntrack search, running under local_bh_disable(). These RCU changes to conntrack were introduced in ea781f19 by Eric Dumazet. Code: net/netfilter/nf_conntrack_core.c Func: __nf_conntrack_find() struct nf_conntrack_tuple_hash * __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) { struct nf_conntrack_tuple_hash *h; struct hlist_nulls_node *n; unsigned int hash = hash_conntrack(tuple); /* Disable BHs the entire time since we normally need to disable them * at least once for the stats anyway. */ local_bh_disable(); begin: hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { if (nf_ct_tuple_equal(tuple, &h->tuple)) { NF_CT_STAT_INC(net, found); local_bh_enable(); return h; } NF_CT_STAT_INC(net, searched); } /* * if the nulls value we got at the end of this lookup is * not the expected one, we must restart lookup. * We probably met an item that was moved to another chain. */ if (get_nulls_value(n) != hash) goto begin; local_bh_enable(); return NULL; http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html Its possible to see, that the problems are most likely caused by the number of conntrack elements being ...
We should add a retry limit there. -- Regards, Changli Gao(xiaosuo@gmail.com) --
We can't do that since that would allow false negatives. --
If one hash slot is under attack, then there is a bug somewhere. If we cannot avoid this, we can fallback to a secure mode at the second retry, and take the spinlock. Tis way, most of lookups stay lockless (one pass), and some might take the slot lock to avoid the possibility of a loop. I suspect a bug elsewhere, quite frankly ! We have a chain that have an end pointer that doesnt match the expected one. --
On normal situation, we always finish the lookup : 1) If we found the thing we were looking at. 2) We get the list end (item not found), we then check if it is the expected end. It is _not_ the expected end only if some writer deleted/inserted an element in _this_ chain during our lookup. Because our lookup is lockless, we then have to redo it because we might miss the object we are looking for. If we can do the 'retry' a 10 times, it means the attacker was really clever enough to inject new packets (new conntracks) at the right moment, in the right hash chain, and this sounds so higly incredible that I cannot believe it at all :) --
So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted elements? (Not obvious from the code, but my ignorance of the networking code is such that many things in that part of the kernel are not obvious to me, I am afraid.) Otherwise, of course you would simply allow deleted elements to continue pointing where they did previously, so that concurrent readers would not miss anything. Of course, the same potential might arise on insertion, but it is usually Ah... Is there also a resize operation? Herbert did do a resizable hash table recently, but I was under the impression that (1) it was in some other part of the networking stack and (2) it avoided the need to Or maybe the DoS attack is injecting so many new conntracks that a large fraction of the hash chains are being modified at any given time? Thanx, Paul --
OK, that will do it!!! ;-) One way of throttling the bad effects of updates on readers is to periodically force updates through a grace period. But this seems to be a very big hammer, and likely to have little practical effect. Another approach would be to have multiple list pointers per element, so that a given element could be reused a small number of times without messing up concurrent readers (sort of like Herbert's resizable hash table). But as you say, if some other bug is really behind this, better to fix ;-) ;-) ;-) Thanx, Paul --
I think its plausable, there is a lot of modification going on. Approx 40.000 deletes/sec and 40.000 inserts/sec. The hash bucket size is 300032, and with 80000 modifications/sec, we are (potentially) changing 26.6% of the hash chains each second. As can be seen from the graphs: http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html Notice that primarily CPU2 is doing the 40k deletes/sec, while CPU1 is Guess I have to reproduce the DoS attack in a testlab (I will first have time Tuesday). So we can determine if its bad hashing or restart of the search loop. The traffic pattern was fairly simple: 200 bytes UDP packets, comming from approx 60 source IPs, going to one destination IP. The UDP destination port number was varied in the range of 1 to 6000. The source UDP port was varied a bit more, some ranging from 32768 to 61000, and some from 1028 to 5000. Cheers, Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk -------------------------------------------------------------------
OK but a lookup last a fraction of a micro second, unless interrupted by hard irq. Probability of a change during a lookup should be very very small. Note that the scenario for a restart is : The lookup go through the chain. While it is examining one object, this object is deleted. The object is re-allocated by another cpu and inserted to a new chain. --
Or very long chains, if attacker managed to find a jhash flaw.
You could add a lookup_restart counter :
include/linux/netfilter/nf_conntrack_common.h | 1 +
net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 7 ++++---
net/netfilter/nf_conntrack_core.c | 4 +++-
net/netfilter/nf_conntrack_standalone.c | 7 ++++---
4 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index c608677..cf9a8df 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -113,6 +113,7 @@ struct ip_conntrack_stat {
unsigned int expect_new;
unsigned int expect_create;
unsigned int expect_delete;
+ unsigned int lookup_restart;
};
/* call to create an explicit dependency on nf_conntrack. */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 2fb7b76..95f2227 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -336,12 +336,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
const struct ip_conntrack_stat *st = v;
if (v == SEQ_START_TOKEN) {
- seq_printf(seq, "entries searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error expect_new expect_create expect_delete\n");
+ seq_printf(seq, "entries searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error expect_new expect_create expect_delete lookup_restart\n");
return 0;
}
seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x "
- "%08x %08x %08x %08x %08x %08x %08x %08x \n",
+ "%08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
nr_conntracks,
st->searched,
st->found,
@@ -358,7 +358,8 @@ static int ct_cpu_seq_show(struct ...Eric, I wonder if we run into some kind of issue on 32-bit systems because we always lose a bit of the conntrack hash value when we store it into the 'nulls' area? Wouldn't that make the "get_nulls_value(n) != hash" fail? --
Well, 'hash' at this time is not the result of the jhash() transform [0
- 0xFFFFFFFF], but a slot number in htable [0 - (300032-1)].
And we can have a nulls_value up to 0x7FFFFFFF (31 bits)
static inline unsigned long get_nulls_value(const struct hlist_nulls_node *ptr)
{
return ((unsigned long)ptr) >> 1;
}
--
From: Eric Dumazet <eric.dumazet@gmail.com> Aha, I see. I really can't see what might cause this behavior then. --
From: David Miller <davem@davemloft.net>
This all reminds me of the namespace bug we dealt with
a month or two ago.
Jesper, you don't happen to be using network namespaces are you?
Because if so, the following might be your cure.
commit 5b3501faa8741d50617ce4191c20061c6ef36cb3
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon Feb 8 11:16:56 2010 -0800
netfilter: nf_conntrack: per netns nf_conntrack_cachep
--
No, I don't use network namespaces. (In .config CONFIG_NAMESPACES is not set.) Cheers, Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- --
I've applied Jespers equivalent patch. --
Yes of course, I missed it or I would not have cooked it ;) Thanks --
2.6.31.7-pvlan2G #3 SMP PREEMPT 32-bit kernel with 2G kernel mem (you showed me that trick). I have some patches on top (which are accepted upstream), but I originate from commit f8ebcb2ebc49a Linux 2.6.31.7. Cheers, Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- --
Since when is enabling 2G a trick? :) There's CONFIG_VMSPLIT_2G (and 2G_OPT) for quite some time now. --
Yes, when you know it, its not a trick anymore :) Years ago, we had to manually change PAGE_OFFSET, and I remember some machines with PAGE_OFFSET 0xA0000000 (1.5 GB LOWMEM), or 0xB0000000 (1.25 GB), (PAE off) --
I notice that 0xB0000000, which is now known as LOWMEM_3G_OPT, is only available when PAE is off. Would you know the reason for that decision? Are some values unsuitable for PAE? --
If PAE was on, PAGE_OFFSET must be a 1GB multiple. This is because of hardware limitations. --
I think another scenario that seems a bit more likely would be that a new entry is added to the chain after it was fully searched. Perhaps we could continue searching at the last position if the last entry is not a nulls entry to improve this. --
But the last entry is always a nulls entry, what do you mean exactly ? When an unsert (of a fresh object, not a reused one) is done, this doesnt affect lookups in any way, since its done at the head of list. --
Re-reading this, I am not sure there is a real problem on RCU as you pointed out. With 800.000 entries, in a 300.032 buckets hash table, each lookup hit about 3 entries (aka searches in conntrack stats) 300.000 packets/second -> 900.000 'searches' per second. If you have four cpus all trying to insert/delete entries in //, they all hit the central conntrack lock. On a DDOS scenario, every packet needs to take this lock twice, once to free an old conntrack (early drop), once to insert a new entry. To scale this, only way would be to have an array of locks, like we have for TCP/UDP hash tables. I did some tests here, with a multiqueue card, flooded with 300.000 pack/second, 65.536 source IP, millions of flows, and nothing wrong happened (but packets drops, of course) My two cpus were busy 100%, after tweaking smp_affinities, because on first try, irqbalance put "01" mask on both queues, so only one ksoftirq was working, other cpu was idle :( --
The machine is not getting 300.000 pps, its only getting 40.000 pps (the rest is stopped by the NIC by sending Ethernet flowcontrol pause frames) http://people.netfilter.org/hawk/DDoS/2010-04-12__001/eth0-rx.png We are doing 700.000 'searches' per second, with 40.000 pps, thus on average the list lenght (in each hash bucket) just need to be 17.5 elements. Is this an acceptable has distribution, with 900.000 elements This machine only have two CPUs, or rather one physical CPU and one hyperthreaded. The server is a old HP380 G4, with an old Xeon type CPU 3.4 GHz (1MB cache). If remember correctly its based on Pentium-4 I was worried about if the "early drop" e.g. free an old conntrack would A small hint when testing, use Haralds tool 'lnstat' to see the stats on the command line, thus you don't need to RRDtool graphe every thing: Command: lnstat -f nf_conntrack -i 1 -c 1000 I don't have a multiqueue NIC in this old machine. I also ran some tests on my 10G testlab, but it didn't go wrong. Tweeking the pktgen DDoS I could get the system to do 4.500.000 'searches' per sec with a 1.500.000 packets/sec. (Have not reloaded the kernel with the new failed lookup stats). Guess my 10G machines are too fast to hit Think my machine is some what slower than yours, perhaps its simply not fast enough for this kind of workload (pretty sure that the cache is the CPU is getting f*ked in this case). On my machine one CPU in stuck in softirq: http://people.netfilter.org/hawk/DDoS/2010-04-12__001/cpu_softirq001.png And another observation is that the CPUs are disturbing each other on the RX softirq code path. http://people.netfilter.org/hawk/DDoS/2010-04-12__001/softnet_time_squeeze_rx-softirq0... (/proc/net/softnet_stat column 3) Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to see if I can reproduce the DDoS attack senario. And see if I can get it into to lookup loop. Hilsen Jesper ...
Theorically a loop is very unlikely, given a single retry is very unlikly too. Unless a cpu gets in its cache a corrupted value of a 'next' pointer. Maybe a hardware problem ? My test machine is a fairly low end one, an AMD Athlon Dual core 5050e, 2.6 GHz I used an igb card for ease of setup, and to make sure my two cores would handle packets in parallel, without RPS. With same hash bucket size (300.032) and max conntracks (800.000), and after more than 10 hours of test, not a single lookup was restarted because of a nulls with wrong value. I can setup a test on a 16 cpu machine, multiqueue card too. Hmm, I forgot to say I am using net-next-2.6, not your kernel version... --
So fare, I have to agree with you. I have now tested on the same type of hardware (although running a 64-bit kernel, and off net-next-2.6), and the result is, the same as yours, I don't see a any restarts of the loop. The test systems differs a bit, as it has two physical CPU and 2M cache (and annoyingly the system insists on using HPET as clocksource). Guess, the only explanation would be bad/sub-optimal hash distribution. With 40 kpps and 700.000 'searches' per second, the hash bucket list length "only" need to be 17.5 elements on average, where optimum is 3. With my pktgen DoS test, where I tried to reproduce the DoS attack, only Don't think that is necessary. My theory was it was possible on slower single queue NIC, where one CPU is 100% busy in the conntrack search, and the other CPUs delete the entries (due to early drop and I also did this test using net-next-2.6, perhaps I should try the version I use in production... -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network Kernel Developer Cand. Scient Datalog / MSc.CS Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer --
I had a look at current conntrack and found the 'unconfirmed' list was
maybe a candidate for a potential blackhole.
That is, if a reader happens to hit an entry that is moved from regular
hash table slot 'hash' to unconfirmed list, reader might scan whole
unconfirmed list to find out he is not anymore on the wanted hash chain.
Problem is this unconfirmed list might be very very long in case of
DDOS. It's really not designed to be scanned during a lookup.
So I guess we should stop early if we find an unconfirmed entry ?
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..0573641 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -298,8 +298,10 @@ extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
extern unsigned int nf_conntrack_htable_size;
extern unsigned int nf_conntrack_max;
-#define NF_CT_STAT_INC(net, count) \
+#define NF_CT_STAT_INC(net, count) \
__this_cpu_inc((net)->ct.stat->count)
+#define NF_CT_STAT_ADD(net, count, value) \
+ __this_cpu_add((net)->ct.stat->count, value)
#define NF_CT_STAT_INC_ATOMIC(net, count) \
do { \
local_bh_disable(); \
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index eeeb8bc..e96d999 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -299,6 +299,7 @@ __nf_conntrack_find(struct net *net, u16 zone,
struct nf_conntrack_tuple_hash *h;
struct hlist_nulls_node *n;
unsigned int hash = hash_conntrack(net, zone, tuple);
+ unsigned int cnt = 0;
/* Disable BHs the entire time since we normally need to disable them
* at least once for the stats anyway.
@@ -309,10 +310,19 @@ begin:
if (nf_ct_tuple_equal(tuple, &h->tuple) &&
nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
NF_CT_STAT_INC(net, found);
+ NF_CT_STAT_ADD(net, searched, cnt);
local_bh_enable();
return h;
...Sorry, but I can't find where we do this things. unconfirmed list is used to track the unconfirmed cts, whose corresponding skbs are still in path from the first to the last netfilter hooks. As soon as the skbs end their travel in netfilter, the corresponding cts will be confirmed(moving ct from unconfirmed list to regular hash table). unconfirmed list should be small, as networking receiving is in BH. -- Regards, Changli Gao(xiaosuo@gmail.com) --
So netfilter is a monolithic thing. When a packet begins its travel into netfilter, you guarantee that no other packet can also begin its travel and find an unconfirmed conntrack ? So according to you, netfilter/ct runs only in input path ? So I assume a packet is handled by CPU X, creates a new conntrack (possibly early droping an old entry that was previously in a standard hash chain), inserted in unconfirmed list. _You_ guarantee another CPU Y, handling another packet, possibly sent by a hacker reading your I first implemented such a patch to reduce cache line contention, then I asked to myself : What is exactly an unconfirmed conntrack ? Can their number be unbounded ? If yes, we have a problem, even on a two cpus machine. Using two lists instead of one wont solve the fundamental problem. The real question is, why do we need this unconfirmed 'list' in the first place. Is it really a private per cpu thing ? Can you prove this, in respect of lockless lookups, and things like NFQUEUE ? Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL, one for IP_CT_DIR_REPLY. Unconfirmed list use the first anchor. This means another cpu can definitely find an unconfirmed item in a regular hash chain, since we dont respect an RCU grace period before re-using an object. If memory was not a problem, we probably would use a third anchor to avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant. --
No. there are another entrances: local out and nf_reinject. If there isn't any packet queued, as netfilter is in atomic context, the nubmer No, it isn't really private. But most of time, it is accessed locally, thanks for your explaining. -- Regards, Changli Gao(xiaosuo@gmail.com) --
If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be added to the unconfirmed list and moved to the hash as soon as the packet passes POST_ROUTING. This means the number of unconfirmed entries created by the network is bound by the number of CPUs due to BH processing. The number created by locally generated packets is unbound Its used for cleaning up conntracks not in the hash table yet on module unload (or manual flush). It is supposed to be write-only So I guess we should check the CONFIRMED bit when searching in the hash table. --
OK, we should have a percpu list then. BTW, I notice nf_conntrack_untracked is incorrectly annotated '__read_mostly'. It can be written very often :( Should'nt we special case it and let be really const ? --
That would need quite a bit of special-casing to avoid touching the reference counts. So far this is completely hidden, so I'd say it just shouldn't be marked __read_mostly. Alternatively we can make "untracked" a nfctinfo state. --
I tried this suggestion, (a new IP_CT_UNTRACKED ctinfo), over a per_cpu
untracked ct, but its a bit hard.
For example, I cannot find a way to change ctnetlink_conntrack_event() :
if (ct == &nf_conntrack_untracked)
return 0;
Maybe this piece of code is not necessary, we should not come here
anyway, or it means several packets could store events for this (shared)
ct ?
Obviously, an IPS_UNTRACKED bit would be much easier to implement.
Would it be acceptable ?
Preliminary patch with IP_CT_UNTRACKED, probably not working at all...
include/linux/netfilter/nf_conntrack_common.h | 3 +
include/net/netfilter/nf_conntrack.h | 11 +++--
include/net/netfilter/nf_conntrack_core.h | 2
net/ipv4/netfilter/nf_nat_core.c | 4 +
net/ipv4/netfilter/nf_nat_standalone.c | 2
net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 4 -
net/netfilter/nf_conntrack_core.c | 32 +++++++++------
net/netfilter/nf_conntrack_netlink.c | 6 +-
net/netfilter/xt_CT.c | 13 +++---
net/netfilter/xt_NOTRACK.c | 4 -
net/netfilter/xt_TEE.c | 8 +--
net/netfilter/xt_cluster.c | 2
net/netfilter/xt_conntrack.c | 2
net/netfilter/xt_socket.c | 2
14 files changed, 58 insertions(+), 37 deletions(-)
diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index 14e6d32..5f7c947 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -15,6 +15,9 @@ enum ip_conntrack_info {
IP_CT_DIR_ORIGINAL); may be a retransmission. */
IP_CT_NEW,
+ /* Untracked */
+ IP_CT_UNTRACKED,
+
/* >= this indicates reply direction */
IP_CT_IS_REPLY,
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..884ade9 ...We probably shouldn't be reaching that code since that would mean that we previously did modifications to the untracked conntrack. That also would be fine. However the main idea behind using a nfctinfo bit was that we wouldn't need the untracked conntrack anymore at all. But I guess a per-cpu untrack conntrack would already be an improvement over the current situation. --
I think Eric didn't mean ip_conntrack_info but ip_conntrack_status bit. Since we have had a IPS_TEMPLATE bit, I think another IPS_UNTRACKED bit is also acceptable. -- Regards, Changli Gao(xiaosuo@gmail.com) --
Yes, of course. But using one of these bits implies that we'd still have the untracked conntrack. --
Yes, it was my idea, with a per_cpu untracked conntrack. I'll submit a patch, thanks. --
Here is first part, introducing IPS_UNTRACKED bit and various helpers to abstract nf_conntrack_untracked access. I'll cook second patch in a couple of hours for per_cpu conversion. Thanks ! [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit NOTRACK makes all cpus share a cache line on nf_conntrack_untracked twice per packet. This is bad for performance. __read_mostly annotation is also a bad choice. This patch introduces IPS_UNTRACKED bit so that we can use later a per_cpu untrack structure more easily. A new helper, nf_ct_untracked_get() returns a pointer to nf_conntrack_untracked. Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add IPS_NAT_DONE_MASK bits to untracked status. nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- include/linux/netfilter/nf_conntrack_common.h | 4 ++++ include/net/netfilter/nf_conntrack.h | 12 +++++++++--- include/net/netfilter/nf_conntrack_core.h | 2 +- net/ipv4/netfilter/nf_nat_core.c | 2 +- net/ipv4/netfilter/nf_nat_standalone.c | 2 +- net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 2 +- net/netfilter/nf_conntrack_core.c | 11 ++++++++--- net/netfilter/nf_conntrack_netlink.c | 2 +- net/netfilter/xt_CT.c | 4 ++-- net/netfilter/xt_NOTRACK.c | 2 +- net/netfilter/xt_TEE.c | 4 ++-- net/netfilter/xt_cluster.c | 2 +- net/netfilter/xt_conntrack.c | 11 ++++++----- net/netfilter/xt_socket.c | 2 +- net/netfilter/xt_state.c | 14 ++++++++------ 15 files changed, 47 insertions(+), 29 deletions(-) diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h index 14e6d32..1afd18c 100644 --- ...
NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.
This patch converts it to a per_cpu variable.
We assume same cpu is used for a given packet, entering and exiting the
NOTRACK state.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/netfilter/nf_conntrack.h | 5 +--
net/netfilter/nf_conntrack_core.c | 36 ++++++++++++++++++-------
2 files changed, 28 insertions(+), 13 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3bc38c7..84a4b6f 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -261,11 +261,10 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
u32 seq);
/* Fake conntrack entry for untracked connections */
+DECLARE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
static inline struct nf_conn *nf_ct_untracked_get(void)
{
- extern struct nf_conn nf_conntrack_untracked;
-
- return &nf_conntrack_untracked;
+ return &__raw_get_cpu_var(nf_conntrack_untracked);
}
extern void nf_ct_untracked_status_or(unsigned long bits);
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 6c1da21..9c66141 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
unsigned int nf_conntrack_max __read_mostly;
EXPORT_SYMBOL_GPL(nf_conntrack_max);
-struct nf_conn nf_conntrack_untracked;
-EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
+DEFINE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
+EXPORT_PER_CPU_SYMBOL(nf_conntrack_untracked);
static int nf_conntrack_hash_rnd_initted;
static unsigned int nf_conntrack_hash_rnd;
@@ -1183,10 +1183,21 @@ static void nf_ct_release_dying_list(struct net *net)
spin_unlock_bh(&nf_conntrack_lock);
}
+static int untrack_refs(void)
+{
+ int cnt = 0, cpu;
+
+ for_each_possible_cpu(cpu) ...That doesn't seem to be a valid assumption, the conntrack entry is attached to the skb and processing in the output path might get preempted and rescheduled to a different CPU. --
Thats unfortunate. Ok, only choice then is to not change refcount on the untracked ct, and keep a shared (read only after setup time) untrack structure. --
Oh well, re-reading my patch, I dont see why I said this in Changelog :)
We lazily select the untrack structure in one cpu, then keep the pointer
to this untrack structure, attached to ct.
The (still atomic) increment / decrement of refcount is done on the
saved pointer, not on actual per_cpu structure.
So if a packet is rescheduled on a different CPU, second cpu will "only"
dirty cache line of other cpu, it probably almost never happens...
Thanks
[PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.
This patch converts it to a per_cpu variable.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/netfilter/nf_conntrack.h | 5 +--
net/netfilter/nf_conntrack_core.c | 36 ++++++++++++++++++-------
2 files changed, 28 insertions(+), 13 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3bc38c7..84a4b6f 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -261,11 +261,10 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
u32 seq);
/* Fake conntrack entry for untracked connections */
+DECLARE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
static inline struct nf_conn *nf_ct_untracked_get(void)
{
- extern struct nf_conn nf_conntrack_untracked;
-
- return &nf_conntrack_untracked;
+ return &__raw_get_cpu_var(nf_conntrack_untracked);
}
extern void nf_ct_untracked_status_or(unsigned long bits);
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 6c1da21..9c66141 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
unsigned int nf_conntrack_max __read_mostly;
EXPORT_SYMBOL_GPL(nf_conntrack_max);
-struct nf_conn ...That sounds like a good idea. But lets what for Jesper's test results before we start fixing this problem :) --
I will first have time to perform the tests Monday or Tuesday. BUT I have just noticed there seems to be a corrolation between conntrack early_drop and searches. I have upload a new graph: http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_early_drop002.png I have not had time to checkout the code path yet... Cheers, Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- --
I guess that's somewhat expected when your conntrack table is full and all you're seeing is new connection setup attempts. First you have a search for an existing conntrack, then it attempts to create a new one and tries to early_drop and old one. --
I have added a stats counter to prove my case, which I think we should add to the kernel (to detect the case in the future).
The DDoS attack has disappeared, so I guess I'll try to see if I can reproduce the problem in my testlab.
[PATCH] net: netfilter conntrack extended with extra stat counter.
From: Jesper Dangaard Brouer <hawk@comx.dk>
I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.
Adding a stats counter to see if the search is restarted too often.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
---
include/linux/netfilter/nf_conntrack_common.h | 1 +
.../netfilter/nf_conntrack_l3proto_ipv4_compat.c | 7 ++++---
net/netfilter/nf_conntrack_core.c | 4 +++-
net/netfilter/nf_conntrack_standalone.c | 7 ++++---
4 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index a8248ee..57ff312 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -93,6 +93,7 @@ struct ip_conntrack_stat
unsigned int expect_new;
unsigned int expect_create;
unsigned int expect_delete;
+ unsigned int search_restart;
};
/* call to create an explicit dependency on nf_conntrack. */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 8668a3d..b94f510 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -336,12 +336,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
const struct ip_conntrack_stat *st = v;
if (v == SEQ_START_TOKEN) {
- seq_printf(seq, "entries searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error expect_new expect_create ...