Re: Fw: [Bug 13339] New: rtable leak in ipv4/route.c

Previous thread: [PATCH][be2net] remove napi in the tx path by Ajit Khaparde on Monday, May 18, 2009 - 6:36 pm. (4 messages)

Next thread: [PATCH RFC] v6 expedited "big hammer" RCU grace periods by Paul E. McKenney on Monday, May 18, 2009 - 9:10 pm. (1 message)
From: Stephen Hemminger
Date: Monday, May 18, 2009 - 7:35 pm

Begin forwarded message:

Date: Mon, 18 May 2009 14:10:20 GMT
From: bugzilla-daemon@bugzilla.kernel.org
To: shemminger@linux-foundation.org
Subject: [Bug 13339] New: rtable leak in ipv4/route.c


http://bugzilla.kernel.org/show_bug.cgi?id=13339

           Summary: rtable leak in ipv4/route.c
           Product: Networking
           Version: 2.5
    Kernel Version: 2.6.29
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: IPV4
        AssignedTo: shemminger@linux-foundation.org
        ReportedBy: lav@yar.ru
        Regression: Yes


2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
patch has at least one critical flaw, and another problem.

rt_intern_hash calculates rthi pointer, which is later used for new entry
insertion. The same loop calculates cand pointer which is used to clean the
list. If the pointers are the same, rtable leak occurs, as first the cand is
removed then the new entry is appended to it.

This leak leads to unregister_netdevice problem (usage count > 0).

Another problem of the patch is that it tries to insert the entries in certain
order, to facilitate counting of entries distinct by all but QoS parameters.
Unfortunately, referencing an existing rtable entry moves it to list beginning,
to speed up further lookups, so the carefully built order is destroyed.

For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
it will also destroy the ordering.

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


-- 
--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 5:34 am

I think fixing this bug fast is more important than this
ordering or counting. Could you send your patch proposal?

Thanks,
Jarek P.

--

From: Neil Horman
Date: Tuesday, May 19, 2009 - 8:12 am

I agree that the corner case in which cand and rthi is a problem (although as
you point out easily fixable), and with an alteration to your proposal it can be
done without destroying the order.

As for the reordering by reference however, I don't see any move to front
heuristic being implemented anywhere in the routing path.  Where do you see this
happening?

Neil

--

From: Eric Dumazet
Date: Tuesday, May 19, 2009 - 8:32 am

We could change rt_check_expire() to be smarter and handle any order in chains.

This would let rt_intern_hash() be simpler.


Here is mine, only compiled, not tested yet.

All credits for Stephen for doing the analysis of course :)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..fbe77ad 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -780,12 +780,37 @@ static void rt_do_flush(int process_context)
 #define FRACT_BITS 3
 #define ONE (1UL << FRACT_BITS)
 
+static unsigned long compute_length(struct rtable *head)
+{
+	struct rtable *rth, *aux;
+	unsigned long length = 0;
+
+	for (rth = head; rth != NULL; rth = rth->u.dst.rt_next) {
+		/*
+		 * We ignore items having same hash inputs
+		 * so that entries for different QOS
+		 * levels, and other non-hash input
+		 * attributes don't unfairly skew
+		 * the length computation
+		 */
+		for (aux = head; ;aux = aux->u.dst.rt_next) {
+			if (aux == rth) {
+				length += ONE;
+				break;
+			}
+			if (compare_hash_inputs(&aux->fl, &rth->fl))
+				break;
+		}
+	}
+	return length;
+}
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
-	unsigned long length = 0, samples = 0;
+	unsigned long length, samples = 0;
 	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
 
@@ -795,7 +820,6 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
-	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -821,29 +845,11 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
-					/*
-					 * Only bump our length if the hash
-					 * inputs on entries n and n+1 are not
-					 * the same, we only count entries on
-					 * a chain with equal hash inputs once
-					 * so that entries for different QOS
-					 * levels, and other ...
From: Neil Horman
Date: Tuesday, May 19, 2009 - 9:20 am

I was thinking something along these lines (also completely untested).  The
extra check in rt_intern hash prevents us from identifying a 'group leader' that
is about to be deleted, and sets rthi to point at the group leader rather than
the last entry in the group as it previously did, which we then mark with the
new flag in the dst_entry (updating it on rt_free of course).  This lets us
identify the boundaries of a group, so when we do a move to front heruistic, we
can move the entire group rather than just the single entry.    I prefer this to
Erics approach since, even though rt_check_expire isn't in a performance
critical path, the addition of the extra list search when the list is unordered
makes rt_check_expire run in O(n^2) rather than O(n), which I think will have a
significant impact on high traffic systems.  Since the ordered list creation was
done at almost zero performance cost in rt_intern_hash, I'd like to keep it if
at all possible.

There is of course a shortcomming in my patch in that it doesn't actually do
anything currently with DST_GRPLDR other than set/clear it appropriately, I've
yet to identify where the route cache move to front heruistic is implemented.
If someone could show me where that is, I'd be happy to finish this patch.

Neil

--

From: Eric Dumazet
Date: Tuesday, May 19, 2009 - 11:47 am

O(n^2) ? Not exactly ...

N * (Y + x)  instead of N * (Y),  where x <= Y/100, and we can make x being 0
in fact with proper prefetching.

Real cost is bringing into CPU cache the information needed, by two
orders of magnitude. (one cache line miss : more than 200 cycles,
and we need at least two cache lines per rtable.)

Adding 'group' notion to dst entries is overengineering an already
too complex ipv4/route.c file, that is almost unreadable these days,
since even you can not spot the point where we move the entry at chain
front.

Hint : search for /* Put it first */ in rt_intern_hash()

Moving whole group in front would defeat the purpose of move, actually,
since rank in chain is used to decay the timeout in garbage collector.
(search for tmo >>= 1; )


Here is is a revised patch, doing the length computation inside
the loop, while cpu is prefeching next entry into its cache.

BTW, we actually have another bug, since length is set to 0 at the wrong place
(before the for (; goal > 0; goal--) loop)

We must of course reset length to 0 for each chain.

All credits for Alexander V. Lukyanov for doing the analysis of course :)


 net/ipv4/route.c |   57 ++++++++++++++-------------------------------
 1 files changed, 18 insertions(+), 39 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..769f72e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -784,7 +784,7 @@ static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
-	struct rtable *rth, **rthp;
+	struct rtable *rth, *aux, **rthp;
 	unsigned long length = 0, samples = 0;
 	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
@@ -795,7 +795,6 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
-	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -809,8 +808,10 @@ static void rt_check_expire(void)
 
 		if (*rthp == NULL)
 ...
From: Neil Horman
Date: Tuesday, May 19, 2009 - 12:24 pm

you're right, on closer examination, its not O(n^2), but its not what you have
From the way I read you patch (which is actually pretty clever as I look at it),
you add an extra loop that searches from the start of the given chain to the
point at which the outer loop is set looking for a duplicate, so that each
rtable entry which varies only by non-hash relevant values is counted only once.
Thats good.  But that means for a chain of n nodes in which you are checking the
ith node for duplicates you need to visit at worst i other nodes, so to process
the entire chain we need to visit SUM(i=1..n)(i) == (n(n-1)/2) nodes rather than
the simply linear search requireing the inspection of only n nodes.  I see how
with prefetching the cost of the additional node visitation can be minimized,
Ah! So its not actually on rtable access (i.e. route lookup) that we do a
move-to-front, its only when we're hashing in new entries.  That explains why I
didn't see it, I was thinking route lookups moved the entry to the front, not
Argh, so the list is implicitly ordered by expiration time.  That really defeats
the entire purpose of doing grouping in the ilst at all.  If thats the case,
then I agree, its probably better to to take the additional visitation hit in in
check_expire above than to try and preserve ordering. 

Thanks
Neil

--

From: David Miller
Date: Tuesday, May 19, 2009 - 3:05 pm

From: Neil Horman <nhorman@tuxdriver.com>

Yes, this seems best.

I was worried that somehow the ordering also influences lookups,
because the TOS bits don't go into the hash so I worried that it would
be important that explicit TOS values appear before wildcard ones.
But it doesn't appear that this is an issue, we don't have wildcard
TOSs in the rtable entries, they are always explicit.

So I would like to see an explicit final patch from Eric so we can get
this fixed now.

Thanks!
--

From: Neil Horman
Date: Tuesday, May 19, 2009 - 4:05 pm

Agreed, thanks once again for educating me Eric!
Neil

--

From: Eric Dumazet
Date: Tuesday, May 19, 2009 - 9:54 pm

I would like to split patches because we have two bugs indeed, and
I prefer to get attention for both problems, I dont remember Neil acknowledged
the length computation problem.

First and small patch, candidate for net-2.6 and stable (for 2.6.29) :

Thank you

[PATCH] net: fix length computation in rt_check_expire()

rt_check_expire() computes average and standard deviation of chain lengths,
but not correclty reset length to 0 at beginning of each chain.
This probably gives overflows for sum2 (and sum) on loaded machines instead 
of meaningful results.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/route.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..869cf1c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -785,7 +785,7 @@ static void rt_check_expire(void)
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
-	unsigned long length = 0, samples = 0;
+	unsigned long samples = 0;
 	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
 
@@ -795,9 +795,9 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
-	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
+		unsigned long length;
 
 		i = (i + 1) & rt_hash_mask;
 		rthp = &rt_hash_table[i].chain;
@@ -809,6 +809,7 @@ static void rt_check_expire(void)
 
 		if (*rthp == NULL)
 			continue;
+		length = 0;
 		spin_lock_bh(rt_hash_lock_addr(i));
 		while ((rth = *rthp) != NULL) {
 			if (rt_is_expired(rth)) {

--

From: David Miller
Date: Tuesday, May 19, 2009 - 11:13 pm

From: Eric Dumazet <dada1@cosmosbay.com>

Yes I definitely agree this length initialization is a bug.

I'll integrate this fix soon, thanks Eric.  You can send me the
other part of the bug fix relative to this if you like.
--

From: Eric Dumazet
Date: Tuesday, May 19, 2009 - 11:14 pm

Then here is the patch on top on previous one.

Thanks to all

[PATCH] net: fix rtable leak in net/ipv4/route.c

Alexander V. Lukyanov found a regression in 2.6.29 and made a complete
analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339
Quoted here because its a perfect one :

begin_of_quotation 
 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
 patch has at least one critical flaw, and another problem.

 rt_intern_hash calculates rthi pointer, which is later used for new entry
 insertion. The same loop calculates cand pointer which is used to clean the
 list. If the pointers are the same, rtable leak occurs, as first the cand is
 removed then the new entry is appended to it.

 This leak leads to unregister_netdevice problem (usage count > 0).

 Another problem of the patch is that it tries to insert the entries in certain
 order, to facilitate counting of entries distinct by all but QoS parameters.
 Unfortunately, referencing an existing rtable entry moves it to list beginning,
 to speed up further lookups, so the carefully built order is destroyed.

 For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
 it will also destroy the ordering.
end_of_quotation


Problematic commit is 1080d709fb9d8cd4392f93476ee46a9d6ea05a5b
(net: implement emergency route cache rebulds when gc_elasticity is exceeded)

Trying to keep dst_entries ordered is too complex and breaks the fact that
order should depend on the frequency of use for garbage collection.

A possible fix is to make rt_intern_hash() simpler, and only makes
rt_check_expire() a litle bit smarter, being able to cope with an arbitrary
entries order. The added loop is running on cache hot data, while cpu
is prefetching next object, so should be unnoticied.

Reported-and-analyzed-by: Alexander V. Lukyanov <lav@yar.ru>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/route.c |   55 +++++++++++++--------------------------------
 1 files ...
From: Jarek Poplawski
Date: Wednesday, May 20, 2009 - 3:03 am

Very "interesting" for() usage, but isn't it more readable like this?:

					aux = rt_hash_table[i].chain;
					while (aux != rth) {
						if (compare_hash_inputs(&aux->fl, &rth->fl))
							break;
						aux = aux->u.dst.rt_next;
					}

					if (aux == rth)
						length += ONE;

--

From: Eric Dumazet
Date: Wednesday, May 20, 2009 - 4:13 am

I first wrote :

					for (aux = rt_hash_table[i].chain ; ; aux = aux->u.dst.rt_next) {
						if (aux == rth) {
							length += ONE;
							break;
						}
						if (compare_hash_inputs(&aux->fl, &rth->fl))
							break;
					}

but had to split the too long line, so ended in the form in the patch :)


Thank you


--

From: Jarek Poplawski
Date: Wednesday, May 20, 2009 - 4:37 am

I know, but I guess this is used quite often. And probably it's not
very hard optimization for a compiler (while - else). As a matter of
fact even this would confuse me less here:
					aux = rt_hash_table[i].chain;
					for (;;) {

But of course, it's a matter of taste, so no big deal.

Jarek P.
--

From: Neil Horman
Date: Wednesday, May 20, 2009 - 3:48 am

Acked-by: Neil Horman <nhorman@tuxdriver.com>

--

From: David Miller
Date: Wednesday, May 20, 2009 - 5:19 pm

From: Neil Horman <nhorman@tuxdriver.com>

Also applied and queued up for -stable, thanks!
--

From: Neil Horman
Date: Wednesday, May 20, 2009 - 3:27 am

From: David Miller
Date: Wednesday, May 20, 2009 - 5:19 pm

From: Neil Horman <nhorman@tuxdriver.com>

Applied and queued up for -stable.
--

From: Neil Horman
Date: Tuesday, May 19, 2009 - 9:23 am

Of course, it helps if I attach the patch :)


diff --git a/include/net/dst.h b/include/net/dst.h
index 6be3b08..a39db6d 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -47,6 +47,7 @@ struct dst_entry
 #define DST_NOXFRM		2
 #define DST_NOPOLICY		4
 #define DST_NOHASH		8
+#define DST_GRPLDR		16
 	unsigned long		expires;
 
 	unsigned short		header_len;	/* more space at head required */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..0120f0e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -610,6 +610,8 @@ static inline int ip_rt_proc_init(void)
 
 static inline void rt_free(struct rtable *rt)
 {
+	if (rt->u.dst.flags & DST_GRPLDR)
+		rt->u.dst.rt_next->u.dst.flag |= DST_GRPLDR;
 	call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
 }
 
@@ -1143,8 +1145,11 @@ restart:
 		 * relvant to the hash function together, which we use to adjust
 		 * our chain length
 		 */
-		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
+		if (!*rthi && *rthp && 
+		    compare_hash_inputs(&(*rthp)->fl, &rt->fl) &&
+		    (cand != rth))
 			rthi = rth;
+		}
 	}
 
 	if (cand) {
@@ -1205,11 +1210,14 @@ restart:
 		}
 	}
 
-	if (rthi)
+	if (rthi) {
 		rt->u.dst.rt_next = rthi->u.dst.rt_next;
-	else
+		rthi->u.dst.flags &= ~(DST_GRPLDR);
+	} else
 		rt->u.dst.rt_next = rt_hash_table[hash].chain;
 
+	rt->u.dst.flags |= DST_GRPLDR;
+
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
 		struct rtable *trt;
--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 10:17 am

Does it really prevent cand == rthi in the next loop?

I didn't check Eric's patch yet, but I really don't know what's wrong
with something as simple as below for -stable, until "proper" fix is
analyzed and tested.

Jarek P.
---

 net/ipv4/route.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..f4e6c7a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1157,6 +1157,8 @@ restart:
 		if (chain_length > ip_rt_gc_elasticity) {
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
+			if (rthi == cand)
+				rthi = NULL;
 		}
 	} else {
 		if (chain_length > rt_chain_length_max) {
--

From: Neil Horman
Date: Tuesday, May 19, 2009 - 10:45 am

Yes, because cand and rthi are inspected during the same loop iteration, and
both assigned from rth.  since I added a check which requires !rthi (which is
actually a bug above that I need to fix), once rthi is set, it won't be moved,
and on the next iteration, if cand is assigned, it is assigned to rth, which
Because the above fixes it without continuing to break the ordering.  You're
change below prevents the leak, but still allows for disordered lists to form,
which IMHO doesn't really make it a candidate for -stable.  I'd much rather fix
both the leak and the ordering before pushing anything out

speaking of which, I'm going to ask again, I've been looking all morning, and
I'm unable to find the move to front heuristic that you mentioned creates furthe
list disordering.  If you can point it out to me, I can complete my patch and
present something for you to test more throughly.

--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 10:53 am

As I've written already, your patch looks OK to me.

Thanks,
Jarek P.
--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 11:05 am

On Tue, May 19, 2009 at 01:45:04PM -0400, Neil Horman wrote:

I'd only like to make it clear the analysis you refer to is by
Alexander V. Lukyanov <lav@yar.ru>

Jarek P.
--

From: Neil Horman
Date: Tuesday, May 19, 2009 - 11:16 am

Ok, thank you for giving credit where its due :).  Alexander, can you point out
the location of the move-to-front heruistic that you mention in your analysis?

--

From: Alexander V. Lukyanov
Date: Tuesday, May 19, 2009 - 11:36 pm

It's near /* Put it first */ comment (if I read the code correctly).

-- 
   Alexander..
--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 10:47 am

Hmm.. It looks OK yet!

Sorry,
Jarek P.
--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 10:22 am

- All credits for Stephen for doing the analysis of course :)
+ All credits for Alexander V. Lukyanov for doing the analysis of course :)

Jarek P.
--

Previous thread: [PATCH][be2net] remove napi in the tx path by Ajit Khaparde on Monday, May 18, 2009 - 6:36 pm. (4 messages)

Next thread: [PATCH RFC] v6 expedited "big hammer" RCU grace periods by Paul E. McKenney on Monday, May 18, 2009 - 9:10 pm. (1 message)