Re: [PATCH 0/20] Batch network namespace cleanup

Previous thread: [patch][ipsec] can not add camellia cipher algorithm when using "ip xfrm state" command by Li Yewang on Sunday, November 29, 2009 - 6:39 pm. (2 messages)

Next thread: [PATCH 01/20] net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCH by Eric W. Biederman on Sunday, November 29, 2009 - 6:45 pm. (21 messages)
From: Eric W. Biederman
Date: Sunday, November 29, 2009 - 6:46 pm

Recently Jamal and Daniel perform some experiments and found that
large numbers of network namespace exiting simultaneously is very
inefficient.  24+ minutes in some configurations.  The cpu overhead
was negligible but it results in long hold times of net_mutex, and
memory being consumed a long time after the last user has gone away.

I looked into it and discovered that by batching network namespace
cleanups I can reduce the time for 4k network namespaces exiting from
5-7 minutes in my configuration to 44 seconds.

This patch series is my set of changes to the network namespace core
and associated cleanups to allow for network namespace batching.

Eric


 drivers/net/bonding/bond_main.c         |   32 +---
 drivers/net/loopback.c                  |    8 -
 drivers/net/ppp_generic.c               |   30 +---
 drivers/net/pppoe.c                     |   38 +----
 drivers/net/pppol2tp.c                  |   36 +----
 include/linux/netdevice.h               |    2 +
 include/linux/notifier.h                |    2 +-
 include/net/net_namespace.h             |    8 +-
 include/net/netns/generic.h             |    8 +-
 include/net/route.h                     |    1 +
 net/8021q/vlan.c                        |   33 +---
 net/core/dev.c                          |   52 ++-----
 net/core/net_namespace.c                |  254 +++++++++++++++++++------------
 net/ipv4/fib_frontend.c                 |    4 +-
 net/ipv4/ip_gre.c                       |   24 +---
 net/ipv4/ipip.c                         |   24 +---
 net/ipv4/route.c                        |    6 +
 net/ipv6/ip6_tunnel.c                   |   25 +---
 net/ipv6/sit.c                          |   25 +---
 net/key/af_key.c                        |   25 +---
 net/netfilter/nf_conntrack_proto_dccp.c |   31 +---
 net/netfilter/nf_conntrack_proto_gre.c  |   20 +--
 net/phonet/pn_dev.c                     |   16 +--
 23 files changed, 284 insertions(+), 420 deletions(-)

Eric W. Biederman (20):
      net: ...
From: Eric Dumazet
Date: Monday, November 30, 2009 - 1:07 am

It seems list received part of your patches Eric

01, 07-20 are OK,   02-06 are missing

Thanks
--

From: David Miller
Date: Monday, November 30, 2009 - 1:09 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Not just the list, I didn't get copies of the missing patches through
direct email even though I was on the CC: list.
--

From: Eric W. Biederman
Date: Monday, November 30, 2009 - 1:17 am

Thanks.  Weird.  I will take a look and resend.

Eric

--

From: Eric W. Biederman
Date: Monday, November 30, 2009 - 1:48 am

I have resent the missing patches, I'm not certain what happened
the first time but clearly those didn't get sent.

02 and 06 have made it to netdev and looped back to me.  I'm not certain
where 03-05 are, but they should be in transit.

Eric
--

From: Eric W. Biederman
Date: Monday, November 30, 2009 - 1:25 am

From: Eric W. Biederman <ebiederm@xmission.com>

I will need this shortly to implement network namespace shutdown
batching.  For sanity sake network devices should be removed in
the reverse order they were created in.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/netdevice.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9428793..daf13d3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1112,6 +1112,8 @@ extern rwlock_t				dev_base_lock;		/* Device list lock */
 
 #define for_each_netdev(net, d)		\
 		list_for_each_entry(d, &(net)->dev_base_head, dev_list)
+#define for_each_netdev_reverse(net, d)	\
+		list_for_each_entry_reverse(d, &(net)->dev_base_head, dev_list)
 #define for_each_netdev_rcu(net, d)		\
 		list_for_each_entry_rcu(d, &(net)->dev_base_head, dev_list)
 #define for_each_netdev_safe(net, d, n)	\
-- 
1.6.5.2.143.g8cc62

--

From: Eric W. Biederman
Date: Monday, November 30, 2009 - 1:25 am

From: Eric W. Biederman <ebiederm@xmission.com>

Defer calling unregister_netdevice_queue to cleanup_net.  It's simpler
and it allows the loopback device to land in the same batch as other
network devices.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/net/loopback.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index c9f6557..eae4ad7 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -212,15 +212,7 @@ out:
 	return err;
 }
 
-static __net_exit void loopback_net_exit(struct net *net)
-{
-	struct net_device *dev = net->loopback_dev;
-
-	unregister_netdev(dev);
-}
-
 /* Registered in net/core/dev.c */
 struct pernet_operations __net_initdata loopback_net_ops = {
        .init = loopback_net_init,
-       .exit = loopback_net_exit,
 };
-- 
1.6.5.2.143.g8cc62

--

From: Eric W. Biederman
Date: Monday, November 30, 2009 - 1:25 am

From: Eric W. Biederman <eric@conroxe.ebiederm.org>

It is fairly common to kill several network namespaces at once.  Either
because they are nested one inside the other or because they are cooperating
in multiple machine networking experiments.  As the network stack control logic
does not parallelize easily batch up multiple network namespaces existing
together.

To get the full benefit of batching the virtual network devices to be
removed must be all removed in one batch.  For that purpose I have added
a loop after the last network device operations have run that batches
up all remaining network devices and deletes them.

An extra benefit is that the reorganization slightly shrinks the size
of the per network namespace data structures replaceing a work_struct
with a list_head.

In a trivial test with 4K namespaces this change reduced the cost of
a destroying 4K namespaces from 7+ minutes (at 12% cpu) to 44 seconds
(at 60% cpu).  The bulk of that 44s was spent in inet_twsk_purge.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/net_namespace.h |    2 +-
 net/core/net_namespace.c    |   66 +++++++++++++++++++++++++++++++++++++-----
 2 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 0addd45..d69b479 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -42,7 +42,7 @@ struct net {
 						 */
 #endif
 	struct list_head	list;		/* list of network namespaces */
-	struct work_struct	work;		/* work struct for freeing */
+	struct list_head	cleanup_list;	/* namespaces on death row */
 
 	struct proc_dir_entry 	*proc_net;
 	struct proc_dir_entry 	*proc_net_stat;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 86ed7f4..a42caa2 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -8,8 +8,10 @@
 #include <linux/idr.h>
 #include <linux/rculist.h>
 #include <linux/nsproxy.h>
+#include <linux/netdevice.h>
 ...
From: Eric W. Biederman
Date: Monday, November 30, 2009 - 1:25 am

From: Eric W. Biederman <ebiederm@xmission.com>

- Defer dellink to net_cleanup() allowing for batching.
- Fix comment.
- Use for_each_netdev_safe again as dev_change_net_namespace touches
  at most one network device (unlike veth dellink).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 net/core/dev.c |   16 ++++++----------
 1 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 124029e..c3e3400 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5735,14 +5735,13 @@ static struct pernet_operations __net_initdata netdev_net_ops = {
 
 static void __net_exit default_device_exit(struct net *net)
 {
-	struct net_device *dev;
+	struct net_device *dev, *aux;
 	/*
-	 * Push all migratable of the network devices back to the
+	 * Push all migratable network devices back to the
 	 * initial network namespace
 	 */
 	rtnl_lock();
-restart:
-	for_each_netdev(net, dev) {
+	for_each_netdev_safe(net, dev, aux) {
 		int err;
 		char fb_name[IFNAMSIZ];
 
@@ -5750,11 +5749,9 @@ restart:
 		if (dev->features & NETIF_F_NETNS_LOCAL)
 			continue;
 
-		/* Delete virtual devices */
-		if (dev->rtnl_link_ops && dev->rtnl_link_ops->dellink) {
-			dev->rtnl_link_ops->dellink(dev, NULL);
-			goto restart;
-		}
+		/* Leave virtual devices for the generic cleanup */
+		if (dev->rtnl_link_ops)
+			continue;
 
 		/* Push remaing network devices to init_net */
 		snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
@@ -5764,7 +5761,6 @@ restart:
 				__func__, dev->name, err);
 			BUG();
 		}
-		goto restart;
 	}
 	rtnl_unlock();
 }
-- 
1.6.5.2.143.g8cc62

--

From: Eric W. Biederman
Date: Monday, November 30, 2009 - 1:25 am

From: Eric W. Biederman <ebiederm@xmission.com>

To get the full benefit of batched network namespace cleanup netowrk
device deletion needs to be performed by the generic code.  When
using register_pernet_gen_device and freeing the data in exit_net
it is impossible to delay allocation until after exit_net has called
as the device uninit methods are no longer safe.

To correct this, and to simplify working with per network namespace data
I have moved allocation and deletion of per network namespace data into
the network namespace core.  The core now frees the data only after
all of the network namespace exit routines have run.

Now it is only required to set the new fields .id and .size
in the pernet_operations structure if you want network namespace
data to be managed for you automatically.

This makes the current register_pernet_gen_device and
register_pernet_gen_subsys routines unnecessary.  For the moment
I have left them as compatibility wrappers in net_namespace.h
They will be removed once all of the users have been updated.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/net_namespace.h |   28 ++++++-
 net/core/net_namespace.c    |  188 +++++++++++++++++++++++--------------------
 2 files changed, 126 insertions(+), 90 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index d69b479..080774b 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -236,6 +236,8 @@ struct pernet_operations {
 	struct list_head list;
 	int (*init)(struct net *net);
 	void (*exit)(struct net *net);
+	int *id;
+	size_t size;
 };
 
 /*
@@ -259,12 +261,30 @@ struct pernet_operations {
  */
 extern int register_pernet_subsys(struct pernet_operations *);
 extern void unregister_pernet_subsys(struct pernet_operations *);
-extern int register_pernet_gen_subsys(int *id, struct pernet_operations *);
-extern void unregister_pernet_gen_subsys(int id, struct pernet_operations *);
 extern int ...
From: jamal
Date: Monday, November 30, 2009 - 5:44 am

Excellent work Eric. I have time today so i can test these patches
status quo vs.
This is against net-next?

cheers,
jamal



--

From: Eric W. Biederman
Date: Monday, November 30, 2009 - 12:22 pm

Yes.

Eric
--

From: David Miller
Date: Monday, November 30, 2009 - 5:34 pm

From: ebiederm@xmission.com (Eric W. Biederman)

All applied, and assuming all of the build checks pass I'll
push this out to net-next-2.6

I should look into that inet_twsk_purge performance issue you mention
when tearing down a namespace.  It walks the entire hash table and
takes a lock for every hash chain.

Eric, is it possible for us to at least slightly optimize this by
doing a peek at whether the head pointer of each chain is NULL and
bypass the spinlock and everything else in that case?  Or is this
not legal with sk_nulls?

Something like:

	if (hlist_nulls_empty(&head->twchain))
		continue;

right before the 'restart' label?
--

From: Eric W. Biederman
Date: Monday, November 30, 2009 - 5:55 pm

I haven't had a chance to wrap my head around that case fully yet.

After playing with a few ideas I think what we want to do algorithmically
is to have a batched flush like we do for the routing table cache.  That
should get the cost down to only about 100ms.  Which is much better when
you have a lot of them but is still a lot of time.

From my preliminary investigation I believe we don't need to take any
locks to traverse the hash table.  I think we can do the entire hash
table traversal under simple rcu protection, and only take the lock on
the individual time wait entries to delete them.

....

Also for best batching we have ipip, ipgre, ip6_tunnel, sit, vlan,
and bridging that need to be taught to use rtnl_link_ops and let
the generic code delete their devices.  

The changes for the vlan code are simple.  The rest I haven't finished
wrapping my head around the drivers individual requirements for.   Still
the changes should not be sweeping.

Eric
--

Previous thread: [patch][ipsec] can not add camellia cipher algorithm when using "ip xfrm state" command by Li Yewang on Sunday, November 29, 2009 - 6:39 pm. (2 messages)

Next thread: [PATCH 01/20] net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCH by Eric W. Biederman on Sunday, November 29, 2009 - 6:45 pm. (21 messages)