Re: DDoS attack causing bad effect on conntrack searches

Previous thread: [PATCH NEXT 0/8]qlcnic: inter driver coexistence fixes by Amit Kumar Salecha on Thursday, April 22, 2010 - 5:51 am. (8 messages)

Next thread: [PATCHv2] Socket filter ancilliary data access for skb->dev->type by Paul LeoNerd Evans on Thursday, April 22, 2010 - 6:32 am. (2 messages)
From: Jesper Dangaard Brouer
Date: Thursday, April 22, 2010 - 5:58 am

At an unnamed ISP, we experienced a DDoS attack against one of our
customers.  This attack also caused problems for one of our Linux
based routers.

The attack was "only" generating 300 kpps (packets per sec), which
usually isn't a problem for this (fairly old) Linux Router.  But the
conntracking system chocked and reduced pps processing power to
40kpps.

I do extensive RRD/graph monitoring of the machines.  The IP conntrack
searches in the period exploded, to a stunning 700.000 searches per
sec.

http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png

First I though it might be caused by bad hashing, but after reading
the kernel code (func: __nf_conntrack_find()), I think its caused by
the loop restart (goto begin) of the conntrack search, running under
local_bh_disable().  These RCU changes to conntrack were introduced in
ea781f19 by Eric Dumazet.

Code: net/netfilter/nf_conntrack_core.c
Func: __nf_conntrack_find()

struct nf_conntrack_tuple_hash *
__nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
{
	struct nf_conntrack_tuple_hash *h;
	struct hlist_nulls_node *n;
	unsigned int hash = hash_conntrack(tuple);

	/* Disable BHs the entire time since we normally need to disable them
	 * at least once for the stats anyway.
	 */
	local_bh_disable();
begin:
	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
		if (nf_ct_tuple_equal(tuple, &h->tuple)) {
			NF_CT_STAT_INC(net, found);
			local_bh_enable();
			return h;
		}
		NF_CT_STAT_INC(net, searched);
	}
	/*
	 * if the nulls value we got at the end of this lookup is
	 * not the expected one, we must restart lookup.
	 * We probably met an item that was moved to another chain.
	 */
	if (get_nulls_value(n) != hash)
		goto begin;
	local_bh_enable();

	return NULL;
 http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html

Its possible to see, that the problems are most likely caused by the
number of conntrack elements being ...
From: Changli Gao
Date: Thursday, April 22, 2010 - 6:13 am

We should add a retry limit there.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Patrick McHardy
Date: Thursday, April 22, 2010 - 6:17 am

We can't do that since that would allow false negatives.
--

From: Eric Dumazet
Date: Thursday, April 22, 2010 - 7:36 am

If one hash slot is under attack, then there is a bug somewhere.

If we cannot avoid this, we can fallback to a secure mode at the second
retry, and take the spinlock.

Tis way, most of lookups stay lockless (one pass), and some might take
the slot lock to avoid the possibility of a loop.

I suspect a bug elsewhere, quite frankly !

We have a chain that have an end pointer that doesnt match the expected
one.




--

From: Eric Dumazet
Date: Thursday, April 22, 2010 - 7:53 am

On normal situation, we always finish the lookup :

1) If we found the thing we were looking at.

2) We get the list end (item not found), we then check if it is the
expected end.

It is _not_ the expected end only if some writer deleted/inserted an
element in _this_ chain during our lookup.

Because our lookup is lockless, we then have to redo it because we might
miss the object we are looking for.

If we can do the 'retry' a 10 times, it means the attacker was really
clever enough to inject new packets (new conntracks) at the right
moment, in the right hash chain, and this sounds so higly incredible
that I cannot believe it at all :)



--

From: Paul E. McKenney
Date: Thursday, April 22, 2010 - 8:51 am

So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
elements?  (Not obvious from the code, but my ignorance of the networking
code is such that many things in that part of the kernel are not obvious
to me, I am afraid.)

Otherwise, of course you would simply allow deleted elements to continue
pointing where they did previously, so that concurrent readers would not
miss anything.

Of course, the same potential might arise on insertion, but it is usually

Ah...  Is there also a resize operation?  Herbert did do a resizable
hash table recently, but I was under the impression that (1) it was in
some other part of the networking stack and (2) it avoided the need to

Or maybe the DoS attack is injecting so many new conntracks that a large
fraction of the hash chains are being modified at any given time?

							Thanx, Paul
--

From: Eric Dumazet
Date: Thursday, April 22, 2010 - 9:02 am

maybe hash table has one slot :)


--

From: Paul E. McKenney
Date: Thursday, April 22, 2010 - 9:34 am

OK, that will do it!!!  ;-)

One way of throttling the bad effects of updates on readers is to
periodically force updates through a grace period.  But this seems
to be a very big hammer, and likely to have little practical effect.

Another approach would be to have multiple list pointers per element,
so that a given element could be reused a small number of times without
messing up concurrent readers (sort of like Herbert's resizable hash
table).

But as you say, if some other bug is really behind this, better to fix

;-) ;-) ;-)

							Thanx, Paul
--

From: Jesper Dangaard Brouer
Date: Thursday, April 22, 2010 - 1:38 pm

I think its plausable, there is a lot of modification going on.
Approx 40.000 deletes/sec and 40.000 inserts/sec.
The hash bucket size is 300032, and with 80000 modifications/sec, we are 
(potentially) changing 26.6% of the hash chains each second.

As can be seen from the graphs:
  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html

Notice that primarily CPU2 is doing the 40k deletes/sec, while CPU1 is 

Guess I have to reproduce the DoS attack in a testlab (I will first have 
time Tuesday).  So we can determine if its bad hashing or restart of the 
search loop.


The traffic pattern was fairly simple:

200 bytes UDP packets, comming from approx 60 source IPs, going to one 
destination IP.  The UDP destination port number was varied in the range 
of 1 to 6000.   The source UDP port was varied a bit more, some ranging 
from 32768 to 61000, and some from 1028 to 5000.


Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
From: Eric Dumazet
Date: Thursday, April 22, 2010 - 2:03 pm

OK but a lookup last a fraction of a micro second, unless interrupted by
hard irq.

Probability of a change during a lookup should be very very small.

Note that the scenario for a restart is :

The lookup go through the chain.
While it is examining one object, this object is deleted.
The object is re-allocated by another cpu and inserted to a new chain.




--

From: Eric Dumazet
Date: Thursday, April 22, 2010 - 2:14 pm

Or very long chains, if attacker managed to find a jhash flaw.

You could add a lookup_restart counter :


 include/linux/netfilter/nf_conntrack_common.h         |    1 +
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |    7 ++++---
 net/netfilter/nf_conntrack_core.c                     |    4 +++-
 net/netfilter/nf_conntrack_standalone.c               |    7 ++++---
 4 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index c608677..cf9a8df 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -113,6 +113,7 @@ struct ip_conntrack_stat {
 	unsigned int expect_new;
 	unsigned int expect_create;
 	unsigned int expect_delete;
+	unsigned int lookup_restart;
 };
 
 /* call to create an explicit dependency on nf_conntrack. */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 2fb7b76..95f2227 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -336,12 +336,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete\n");
+		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete lookup_restart\n");
 		return 0;
 	}
 
 	seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x "
-			"%08x %08x %08x %08x %08x  %08x %08x %08x \n",
+			"%08x %08x %08x %08x %08x  %08x %08x %08x %08x\n",
 		   nr_conntracks,
 		   st->searched,
 		   st->found,
@@ -358,7 +358,8 @@ static int ct_cpu_seq_show(struct ...
From: David Miller
Date: Thursday, April 22, 2010 - 4:44 pm

Eric, I wonder if we run into some kind of issue on 32-bit systems
because we always lose a bit of the conntrack hash value when we store
it into the 'nulls' area?

Wouldn't that make the "get_nulls_value(n) != hash" fail?
--

From: Eric Dumazet
Date: Thursday, April 22, 2010 - 10:44 pm

Well, 'hash' at this time is not the result of the jhash() transform [0
- 0xFFFFFFFF], but a slot number in htable [0 - (300032-1)].


And we can have a nulls_value up to 0x7FFFFFFF  (31 bits)

static inline unsigned long get_nulls_value(const struct hlist_nulls_node *ptr)
{
        return ((unsigned long)ptr) >> 1;
}




--

From: David Miller
Date: Friday, April 23, 2010 - 1:13 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Aha, I see.

I really can't see what might cause this behavior then.
--

From: David Miller
Date: Friday, April 23, 2010 - 1:18 am

From: David Miller <davem@davemloft.net>

This all reminds me of the namespace bug we dealt with
a month or two ago.

Jesper, you don't happen to be using network namespaces are you?
Because if so, the following might be your cure.

commit 5b3501faa8741d50617ce4191c20061c6ef36cb3
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Mon Feb 8 11:16:56 2010 -0800

    netfilter: nf_conntrack: per netns nf_conntrack_cachep
--

From: Jesper Dangaard Brouer
Date: Friday, April 23, 2010 - 1:40 am

No, I don't use network namespaces.
(In .config CONFIG_NAMESPACES is not set.)

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--

From: Patrick McHardy
Date: Friday, April 23, 2010 - 3:36 am

I've applied Jespers equivalent patch.
--

From: Eric Dumazet
Date: Friday, April 23, 2010 - 4:06 am

Yes of course, I missed it or I would not have cooked it ;)

Thanks


--

From: Jesper Dangaard Brouer
Date: Thursday, April 22, 2010 - 2:28 pm

2.6.31.7-pvlan2G #3 SMP PREEMPT
32-bit kernel with 2G kernel mem (you showed me that trick).

I have some patches on top (which are accepted upstream), but I originate 
from commit f8ebcb2ebc49a Linux 2.6.31.7.


Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--

From: Jan Engelhardt
Date: Friday, April 23, 2010 - 12:23 am

Since when is enabling 2G a trick? :)
There's CONFIG_VMSPLIT_2G (and 2G_OPT) for quite some time now.

--

From: Eric Dumazet
Date: Friday, April 23, 2010 - 12:46 am

Yes, when you know it, its not a trick anymore :)

Years ago, we had to manually change PAGE_OFFSET, and I remember some
machines with PAGE_OFFSET 0xA0000000  (1.5 GB LOWMEM), 
or 0xB0000000 (1.25 GB), (PAE off)



--

From: Jan Engelhardt
Date: Friday, April 23, 2010 - 12:55 am

I notice that 0xB0000000, which is now known as LOWMEM_3G_OPT,
is only available when PAE is off. Would you know the reason for
that decision? Are some values unsuitable for PAE?


--

From: Eric Dumazet
Date: Friday, April 23, 2010 - 2:23 am

If PAE was on, PAGE_OFFSET must be a 1GB multiple.
This is because of hardware limitations.


--

From: Patrick McHardy
Date: Friday, April 23, 2010 - 3:55 am

I think another scenario that seems a bit more likely would be
that a new entry is added to the chain after it was fully searched.
Perhaps we could continue searching at the last position if the
last entry is not a nulls entry to improve this.
--

From: Eric Dumazet
Date: Friday, April 23, 2010 - 4:05 am

But the last entry is always a nulls entry, what do you mean exactly ?

When an unsert (of a fresh object, not a reused one) is done, this
doesnt affect lookups in any way, since its done at the head of list.



--

From: Patrick McHardy
Date: Friday, April 23, 2010 - 4:06 am

Right, I missed that :)
--

From: Eric Dumazet
Date: Friday, April 23, 2010 - 1:57 pm

Re-reading this, I am not sure there is a real problem on RCU as you
pointed out.

With 800.000 entries, in a 300.032 buckets hash table, each lookup hit
about 3 entries (aka searches in conntrack stats)

300.000 packets/second -> 900.000 'searches' per second.

If you have four cpus all trying to insert/delete entries in //, they
all hit the central conntrack lock.

On a DDOS scenario, every packet needs to take this lock twice,
once to free an old conntrack (early drop), once to insert a new entry.

To scale this, only way would be to have an array of locks, like we have
for TCP/UDP hash tables.

I did some tests here, with a multiqueue card, flooded with 300.000
pack/second, 65.536 source IP, millions of flows, and nothing wrong
happened (but packets drops, of course)

My two cpus were busy 100%, after tweaking smp_affinities, because on
first try, irqbalance put "01" mask on both queues, so only one ksoftirq
was working, other cpu was idle :(



--

From: Jesper Dangaard Brouer
Date: Saturday, April 24, 2010 - 4:11 am

The machine is not getting 300.000 pps, its only getting 40.000 pps (the 
rest is stopped by the NIC by sending Ethernet flowcontrol pause frames)

  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/eth0-rx.png

We are doing 700.000 'searches' per second, with 40.000 pps, thus on 
average the list lenght (in each hash bucket) just need to be 17.5 
elements.  Is this an acceptable has distribution, with 900.000 elements 

This machine only have two CPUs, or rather one physical CPU and one 
hyperthreaded.  The server is a old HP380 G4, with an old Xeon type CPU 
3.4 GHz (1MB cache).  If remember correctly its based on Pentium-4 

I was worried about if the "early drop" e.g. free an old conntrack would 

A small hint when testing, use Haralds tool 'lnstat' to see the stats on 
the command line, thus you don't need to RRDtool graphe every thing:

Command:
  lnstat -f nf_conntrack -i 1 -c 1000

I don't have a multiqueue NIC in this old machine.

I also ran some tests on my 10G testlab, but it didn't go wrong. 
Tweeking the pktgen DDoS I could get the system to do 4.500.000 'searches' 
per sec with a 1.500.000 packets/sec. (Have not reloaded the kernel with 
the new failed lookup stats).  Guess my 10G machines are too fast to hit 

Think my machine is some what slower than yours, perhaps its simply not 
fast enough for this kind of workload (pretty sure that the cache is the 
CPU is getting f*ked in this case).

On my machine one CPU in stuck in softirq:
  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/cpu_softirq001.png

And another observation is that the CPUs are disturbing each other on the 
RX softirq code path.
http://people.netfilter.org/hawk/DDoS/2010-04-12__001/softnet_time_squeeze_rx-softirq0...
(/proc/net/softnet_stat column 3)

Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to 
see if I can reproduce the DDoS attack senario.  And see if I can get 
it into to lookup loop.

Hilsen
   Jesper ...
From: Eric Dumazet
Date: Saturday, April 24, 2010 - 1:11 pm

Theorically a loop is very unlikely, given a single retry is very
unlikly too.

Unless a cpu gets in its cache a corrupted value of a 'next' pointer.

Maybe a hardware problem ?

My test machine is a fairly low end one, an AMD Athlon Dual core 5050e,
2.6 GHz

I used an igb card for ease of setup, and to make sure my two cores
would handle packets in parallel, without RPS.

With same hash bucket size (300.032) and max conntracks (800.000), and
after more than 10 hours of test, not a single lookup was restarted
because of a nulls with wrong value.

I can setup a test on a 16 cpu machine, multiqueue card too.

Hmm, I forgot to say I am using net-next-2.6, not your kernel version...



--

From: Jesper Dangaard Brouer
Date: Monday, April 26, 2010 - 7:36 am

So fare, I have to agree with you.  I have now tested on the same type
of hardware (although running a 64-bit kernel, and off net-next-2.6),
and the result is, the same as yours, I don't see a any restarts of the
loop.  The test systems differs a bit, as it has two physical CPU and 2M
cache (and annoyingly the system insists on using HPET as clocksource).

Guess, the only explanation would be bad/sub-optimal hash distribution.
With 40 kpps and 700.000 'searches' per second, the hash bucket list
length "only" need to be 17.5 elements on average, where optimum is 3.
With my pktgen DoS test, where I tried to reproduce the DoS attack, only

Don't think that is necessary.  My theory was it was possible on slower
single queue NIC, where one CPU is 100% busy in the conntrack search,
and the other CPUs delete the entries (due to early drop and

I also did this test using net-next-2.6, perhaps I should try the
version I use in production...


-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

--

From: Eric Dumazet
Date: Monday, May 31, 2010 - 2:21 pm

I had a look at current conntrack and found the 'unconfirmed' list was
maybe a candidate for a potential blackhole.

That is, if a reader happens to hit an entry that is moved from regular
hash table slot 'hash' to unconfirmed list, reader might scan whole
unconfirmed list to find out he is not anymore on the wanted hash chain.

Problem is this unconfirmed list might be very very long in case of
DDOS. It's really not designed to be scanned during a lookup.

So I guess we should stop early if we find an unconfirmed entry ?



diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..0573641 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -298,8 +298,10 @@ extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
 extern unsigned int nf_conntrack_htable_size;
 extern unsigned int nf_conntrack_max;
 
-#define NF_CT_STAT_INC(net, count)	\
+#define NF_CT_STAT_INC(net, count)		\
 	__this_cpu_inc((net)->ct.stat->count)
+#define NF_CT_STAT_ADD(net, count, value)	\
+	__this_cpu_add((net)->ct.stat->count, value)
 #define NF_CT_STAT_INC_ATOMIC(net, count)		\
 do {							\
 	local_bh_disable();				\
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index eeeb8bc..e96d999 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -299,6 +299,7 @@ __nf_conntrack_find(struct net *net, u16 zone,
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(net, zone, tuple);
+	unsigned int cnt = 0;
 
 	/* Disable BHs the entire time since we normally need to disable them
 	 * at least once for the stats anyway.
@@ -309,10 +310,19 @@ begin:
 		if (nf_ct_tuple_equal(tuple, &h->tuple) &&
 		    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
 			NF_CT_STAT_INC(net, found);
+			NF_CT_STAT_ADD(net, searched, cnt);
 			local_bh_enable();
 			return h;
 ...
From: Changli Gao
Date: Monday, May 31, 2010 - 5:28 pm

Sorry, but I can't find where we do this things. unconfirmed list is
used to track the unconfirmed cts, whose corresponding skbs are still
in path from the first to the last netfilter hooks. As soon as the
skbs end their travel in netfilter, the corresponding cts will be
confirmed(moving ct from unconfirmed list to regular hash table).

unconfirmed list should be small, as networking receiving is in BH.



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Eric Dumazet
Date: Monday, May 31, 2010 - 10:05 pm

So netfilter is a monolithic thing.

When a packet begins its travel into netfilter, you guarantee that no
other packet can also begin its travel and find an unconfirmed
conntrack ?


So according to you, netfilter/ct runs only in input path ?

So I assume a packet is handled by CPU X, creates a new conntrack
(possibly early droping an old entry that was previously in a standard
hash chain), inserted in unconfirmed list. _You_ guarantee another CPU
Y, handling another packet, possibly sent by a hacker reading your

I first implemented such a patch to reduce cache line contention, then I
asked to myself : What is exactly an unconfirmed conntrack ? Can their
number be unbounded ? If yes, we have a problem, even on a two cpus
machine. Using two lists instead of one wont solve the fundamental
problem.

The real question is, why do we need this unconfirmed 'list' in the
first place. Is it really a private per cpu thing ? Can you prove this,
in respect of lockless lookups, and things like NFQUEUE ? 

Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL,
one for IP_CT_DIR_REPLY.

Unconfirmed list use the first anchor. This means another cpu can
definitely find an unconfirmed item in a regular hash chain, since we
dont respect an RCU grace period before re-using an object.

If memory was not a problem, we probably would use a third anchor to
avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant.



--

From: Changli Gao
Date: Monday, May 31, 2010 - 10:48 pm

No. there are another entrances: local out and nf_reinject. If there
isn't any packet queued, as netfilter is in atomic context, the nubmer


No, it isn't really private. But most of time, it is accessed locally,

thanks for your explaining.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Patrick McHardy
Date: Tuesday, June 1, 2010 - 3:18 am

If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
added to the unconfirmed list and moved to the hash as soon as the
packet passes POST_ROUTING. This means the number of unconfirmed entries
created by the network is bound by the number of CPUs due to BH
processing. The number created by locally generated packets is unbound

Its used for cleaning up conntracks not in the hash table yet on
module unload (or manual flush). It is supposed to be write-only

So I guess we should check the CONFIRMED bit when searching
in the hash table.
--

From: Eric Dumazet
Date: Tuesday, June 1, 2010 - 3:31 am

OK, we should have a percpu list then.

BTW, I notice nf_conntrack_untracked is incorrectly annotated
'__read_mostly'.

It can be written very often :(

Should'nt we special case it and let be really const ?



--

From: Patrick McHardy
Date: Tuesday, June 1, 2010 - 3:41 am

That would need quite a bit of special-casing to avoid touching
the reference counts. So far this is completely hidden, so I'd
say it just shouldn't be marked __read_mostly. Alternatively we
can make "untracked" a nfctinfo state.
--

From: Eric Dumazet
Date: Tuesday, June 1, 2010 - 9:20 am

I tried this suggestion, (a new IP_CT_UNTRACKED ctinfo), over a per_cpu
untracked ct, but its a bit hard.

For example, I cannot find a way to change ctnetlink_conntrack_event() :

	if (ct == &nf_conntrack_untracked)
		return 0;

Maybe this piece of code is not necessary, we should not come here
anyway, or it means several packets could store events for this (shared)
ct ?

Obviously, an IPS_UNTRACKED bit would be much easier to implement.
Would it be acceptable ?


Preliminary patch with IP_CT_UNTRACKED, probably not working at all...

 include/linux/netfilter/nf_conntrack_common.h  |    3 +
 include/net/netfilter/nf_conntrack.h           |   11 +++--
 include/net/netfilter/nf_conntrack_core.h      |    2 
 net/ipv4/netfilter/nf_nat_core.c               |    4 +
 net/ipv4/netfilter/nf_nat_standalone.c         |    2 
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    4 -
 net/netfilter/nf_conntrack_core.c              |   32 +++++++++------
 net/netfilter/nf_conntrack_netlink.c           |    6 +-
 net/netfilter/xt_CT.c                          |   13 +++---
 net/netfilter/xt_NOTRACK.c                     |    4 -
 net/netfilter/xt_TEE.c                         |    8 +--
 net/netfilter/xt_cluster.c                     |    2 
 net/netfilter/xt_conntrack.c                   |    2 
 net/netfilter/xt_socket.c                      |    2 
 14 files changed, 58 insertions(+), 37 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index 14e6d32..5f7c947 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -15,6 +15,9 @@ enum ip_conntrack_info {
            IP_CT_DIR_ORIGINAL); may be a retransmission. */
 	IP_CT_NEW,
 
+	/* Untracked */
+	IP_CT_UNTRACKED,
+
 	/* >= this indicates reply direction */
 	IP_CT_IS_REPLY,
 
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..884ade9 ...
From: Patrick McHardy
Date: Friday, June 4, 2010 - 4:40 am

We probably shouldn't be reaching that code since that would mean
that we previously did modifications to the untracked conntrack.

That also would be fine. However the main idea behind using a nfctinfo
bit was that we wouldn't need the untracked conntrack anymore at all.
But I guess a per-cpu untrack conntrack would already be an improvement
over the current situation.
--

From: Changli Gao
Date: Friday, June 4, 2010 - 5:10 am

I think Eric didn't mean ip_conntrack_info but ip_conntrack_status
bit. Since we have had a IPS_TEMPLATE bit, I think another
IPS_UNTRACKED bit is also acceptable.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Patrick McHardy
Date: Friday, June 4, 2010 - 5:29 am

Yes, of course. But using one of these bits implies that we'd still
have the untracked conntrack.
--

From: Eric Dumazet
Date: Friday, June 4, 2010 - 5:36 am

Yes, it was my idea, with a per_cpu untracked conntrack.

I'll submit a patch, thanks.



--

From: Eric Dumazet
Date: Friday, June 4, 2010 - 9:25 am

Here is first part, introducing IPS_UNTRACKED bit and various helpers to
abstract nf_conntrack_untracked access.

I'll cook second patch in a couple of hours for per_cpu conversion.

Thanks !

[PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet. This is bad for performance.
__read_mostly annotation is also a bad choice.

This patch introduces IPS_UNTRACKED bit so that we can use later a
per_cpu untrack structure more easily.

A new helper, nf_ct_untracked_get() returns a pointer to
nf_conntrack_untracked.

Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
IPS_NAT_DONE_MASK bits to untracked status.

nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netfilter/nf_conntrack_common.h  |    4 ++++
 include/net/netfilter/nf_conntrack.h           |   12 +++++++++---
 include/net/netfilter/nf_conntrack_core.h      |    2 +-
 net/ipv4/netfilter/nf_nat_core.c               |    2 +-
 net/ipv4/netfilter/nf_nat_standalone.c         |    2 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    2 +-
 net/netfilter/nf_conntrack_core.c              |   11 ++++++++---
 net/netfilter/nf_conntrack_netlink.c           |    2 +-
 net/netfilter/xt_CT.c                          |    4 ++--
 net/netfilter/xt_NOTRACK.c                     |    2 +-
 net/netfilter/xt_TEE.c                         |    4 ++--
 net/netfilter/xt_cluster.c                     |    2 +-
 net/netfilter/xt_conntrack.c                   |   11 ++++++-----
 net/netfilter/xt_socket.c                      |    2 +-
 net/netfilter/xt_state.c                       |   14 ++++++++------
 15 files changed, 47 insertions(+), 29 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index 14e6d32..1afd18c 100644
--- ...
From: Eric Dumazet
Date: Friday, June 4, 2010 - 1:15 pm

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.

This patch converts it to a per_cpu variable.

We assume same cpu is used for a given packet, entering and exiting the
NOTRACK state.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/netfilter/nf_conntrack.h |    5 +--
 net/netfilter/nf_conntrack_core.c    |   36 ++++++++++++++++++-------
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3bc38c7..84a4b6f 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -261,11 +261,10 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			       u32 seq);
 
 /* Fake conntrack entry for untracked connections */
+DECLARE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
 static inline struct nf_conn *nf_ct_untracked_get(void)
 {
-	extern struct nf_conn nf_conntrack_untracked;
-
-	return &nf_conntrack_untracked;
+	return &__raw_get_cpu_var(nf_conntrack_untracked);
 }
 extern void nf_ct_untracked_status_or(unsigned long bits);
 
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 6c1da21..9c66141 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
 
-struct nf_conn nf_conntrack_untracked;
-EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
+DEFINE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
+EXPORT_PER_CPU_SYMBOL(nf_conntrack_untracked);
 
 static int nf_conntrack_hash_rnd_initted;
 static unsigned int nf_conntrack_hash_rnd;
@@ -1183,10 +1183,21 @@ static void nf_ct_release_dying_list(struct net *net)
 	spin_unlock_bh(&nf_conntrack_lock);
 }
 
+static int untrack_refs(void)
+{
+	int cnt = 0, cpu;
+
+	for_each_possible_cpu(cpu) ...
From: Patrick McHardy
Date: Tuesday, June 8, 2010 - 7:29 am

That doesn't seem to be a valid assumption, the conntrack entry is
attached to the skb and processing in the output path might get
preempted and rescheduled to a different CPU.
--

From: Eric Dumazet
Date: Tuesday, June 8, 2010 - 7:52 am

Thats unfortunate.

Ok, only choice then is to not change refcount on the untracked ct, and
keep a shared (read only after setup time) untrack structure.





--

From: Eric Dumazet
Date: Tuesday, June 8, 2010 - 8:12 am

Oh well, re-reading my patch, I dont see why I said this in Changelog :)

We lazily select the untrack structure in one cpu, then keep the pointer
to this untrack structure, attached to ct.

The (still atomic) increment / decrement of refcount is done on the
saved pointer, not on actual per_cpu structure.

So if a packet is rescheduled on a different CPU, second cpu will "only"
dirty cache line of other cpu, it probably almost never happens...

Thanks

[PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.

This patch converts it to a per_cpu variable.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/netfilter/nf_conntrack.h |    5 +--
 net/netfilter/nf_conntrack_core.c    |   36 ++++++++++++++++++-------
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3bc38c7..84a4b6f 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -261,11 +261,10 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			       u32 seq);
 
 /* Fake conntrack entry for untracked connections */
+DECLARE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
 static inline struct nf_conn *nf_ct_untracked_get(void)
 {
-	extern struct nf_conn nf_conntrack_untracked;
-
-	return &nf_conntrack_untracked;
+	return &__raw_get_cpu_var(nf_conntrack_untracked);
 }
 extern void nf_ct_untracked_status_or(unsigned long bits);
 
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 6c1da21..9c66141 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
 
-struct nf_conn ...
From: Patrick McHardy
Date: Wednesday, June 9, 2010 - 5:45 am

Applied, thanks Eric.
--

From: Patrick McHardy
Date: Tuesday, June 8, 2010 - 7:12 am

Applied, thanks Eric.
--

From: Patrick McHardy
Date: Friday, April 23, 2010 - 3:56 am

That sounds like a good idea. But lets what for Jesper's test results
before we start fixing this problem :)
--

From: Jesper Dangaard Brouer
Date: Friday, April 23, 2010 - 5:45 am

I will first have time to perform the tests Monday or Tuesday.

BUT I have just noticed there seems to be a corrolation between conntrack 
early_drop and searches.  I have upload a new graph:

  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_early_drop002.png

I have not had time to checkout the code path yet...

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--

From: Patrick McHardy
Date: Friday, April 23, 2010 - 6:57 am

I guess that's somewhat expected when your conntrack table is full
and all you're seeing is new connection setup attempts.

First you have a search for an existing conntrack, then it attempts
to create a new one and tries to early_drop and old one.
--

From: Jesper Dangaard Brouer
Date: Thursday, April 22, 2010 - 6:31 am

I have added a stats counter to prove my case, which I think we should add to the kernel (to detect the case in the future).
The DDoS attack has disappeared, so I guess I'll try to see if I can reproduce the problem in my testlab.



[PATCH] net: netfilter conntrack extended with extra stat counter.

From: Jesper Dangaard Brouer <hawk@comx.dk>

I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.

Adding a stats counter to see if the search is restarted too often.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
---

 include/linux/netfilter/nf_conntrack_common.h      |    1 +
 .../netfilter/nf_conntrack_l3proto_ipv4_compat.c   |    7 ++++---
 net/netfilter/nf_conntrack_core.c                  |    4 +++-
 net/netfilter/nf_conntrack_standalone.c            |    7 ++++---
 4 files changed, 12 insertions(+), 7 deletions(-)


diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index a8248ee..57ff312 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -93,6 +93,7 @@ struct ip_conntrack_stat
 	unsigned int expect_new;
 	unsigned int expect_create;
 	unsigned int expect_delete;
+	unsigned int search_restart;
 };
 
 /* call to create an explicit dependency on nf_conntrack. */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 8668a3d..b94f510 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -336,12 +336,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create ...
From: Patrick McHardy
Date: Friday, April 23, 2010 - 3:35 am

Applied, thanks Jesper.
--

Previous thread: [PATCH NEXT 0/8]qlcnic: inter driver coexistence fixes by Amit Kumar Salecha on Thursday, April 22, 2010 - 5:51 am. (8 messages)

Next thread: [PATCHv2] Socket filter ancilliary data access for skb->dev->type by Paul LeoNerd Evans on Thursday, April 22, 2010 - 6:32 am. (2 messages)