Recently Jamal and Daniel perform some experiments and found that
large numbers of network namespace exiting simultaneously is very
inefficient. 24+ minutes in some configurations. The cpu overhead
was negligible but it results in long hold times of net_mutex, and
memory being consumed a long time after the last user has gone away.
I looked into it and discovered that by batching network namespace
cleanups I can reduce the time for 4k network namespaces exiting from
5-7 minutes in my configuration to 44 seconds.
This patch series is my set of changes to the network namespace core
and associated cleanups to allow for network namespace batching.
Eric
drivers/net/bonding/bond_main.c | 32 +---
drivers/net/loopback.c | 8 -
drivers/net/ppp_generic.c | 30 +---
drivers/net/pppoe.c | 38 +----
drivers/net/pppol2tp.c | 36 +----
include/linux/netdevice.h | 2 +
include/linux/notifier.h | 2 +-
include/net/net_namespace.h | 8 +-
include/net/netns/generic.h | 8 +-
include/net/route.h | 1 +
net/8021q/vlan.c | 33 +---
net/core/dev.c | 52 ++-----
net/core/net_namespace.c | 254 +++++++++++++++++++------------
net/ipv4/fib_frontend.c | 4 +-
net/ipv4/ip_gre.c | 24 +---
net/ipv4/ipip.c | 24 +---
net/ipv4/route.c | 6 +
net/ipv6/ip6_tunnel.c | 25 +---
net/ipv6/sit.c | 25 +---
net/key/af_key.c | 25 +---
net/netfilter/nf_conntrack_proto_dccp.c | 31 +---
net/netfilter/nf_conntrack_proto_gre.c | 20 +--
net/phonet/pn_dev.c | 16 +--
23 files changed, 284 insertions(+), 420 deletions(-)
Eric W. Biederman (20):
net: ...It seems list received part of your patches Eric 01, 07-20 are OK, 02-06 are missing Thanks --
From: Eric Dumazet <eric.dumazet@gmail.com> Not just the list, I didn't get copies of the missing patches through direct email even though I was on the CC: list. --
Thanks. Weird. I will take a look and resend. Eric --
I have resent the missing patches, I'm not certain what happened the first time but clearly those didn't get sent. 02 and 06 have made it to netdev and looped back to me. I'm not certain where 03-05 are, but they should be in transit. Eric --
From: Eric W. Biederman <ebiederm@xmission.com> I will need this shortly to implement network namespace shutdown batching. For sanity sake network devices should be removed in the reverse order they were created in. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- include/linux/netdevice.h | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 9428793..daf13d3 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1112,6 +1112,8 @@ extern rwlock_t dev_base_lock; /* Device list lock */ #define for_each_netdev(net, d) \ list_for_each_entry(d, &(net)->dev_base_head, dev_list) +#define for_each_netdev_reverse(net, d) \ + list_for_each_entry_reverse(d, &(net)->dev_base_head, dev_list) #define for_each_netdev_rcu(net, d) \ list_for_each_entry_rcu(d, &(net)->dev_base_head, dev_list) #define for_each_netdev_safe(net, d, n) \ -- 1.6.5.2.143.g8cc62 --
From: Eric W. Biederman <ebiederm@xmission.com>
Defer calling unregister_netdevice_queue to cleanup_net. It's simpler
and it allows the loopback device to land in the same batch as other
network devices.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
drivers/net/loopback.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index c9f6557..eae4ad7 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -212,15 +212,7 @@ out:
return err;
}
-static __net_exit void loopback_net_exit(struct net *net)
-{
- struct net_device *dev = net->loopback_dev;
-
- unregister_netdev(dev);
-}
-
/* Registered in net/core/dev.c */
struct pernet_operations __net_initdata loopback_net_ops = {
.init = loopback_net_init,
- .exit = loopback_net_exit,
};
--
1.6.5.2.143.g8cc62
--
From: Eric W. Biederman <eric@conroxe.ebiederm.org>
It is fairly common to kill several network namespaces at once. Either
because they are nested one inside the other or because they are cooperating
in multiple machine networking experiments. As the network stack control logic
does not parallelize easily batch up multiple network namespaces existing
together.
To get the full benefit of batching the virtual network devices to be
removed must be all removed in one batch. For that purpose I have added
a loop after the last network device operations have run that batches
up all remaining network devices and deletes them.
An extra benefit is that the reorganization slightly shrinks the size
of the per network namespace data structures replaceing a work_struct
with a list_head.
In a trivial test with 4K namespaces this change reduced the cost of
a destroying 4K namespaces from 7+ minutes (at 12% cpu) to 44 seconds
(at 60% cpu). The bulk of that 44s was spent in inet_twsk_purge.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
include/net/net_namespace.h | 2 +-
net/core/net_namespace.c | 66 +++++++++++++++++++++++++++++++++++++-----
2 files changed, 59 insertions(+), 9 deletions(-)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 0addd45..d69b479 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -42,7 +42,7 @@ struct net {
*/
#endif
struct list_head list; /* list of network namespaces */
- struct work_struct work; /* work struct for freeing */
+ struct list_head cleanup_list; /* namespaces on death row */
struct proc_dir_entry *proc_net;
struct proc_dir_entry *proc_net_stat;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 86ed7f4..a42caa2 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -8,8 +8,10 @@
#include <linux/idr.h>
#include <linux/rculist.h>
#include <linux/nsproxy.h>
+#include <linux/netdevice.h>
...From: Eric W. Biederman <ebiederm@xmission.com>
- Defer dellink to net_cleanup() allowing for batching.
- Fix comment.
- Use for_each_netdev_safe again as dev_change_net_namespace touches
at most one network device (unlike veth dellink).
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
net/core/dev.c | 16 ++++++----------
1 files changed, 6 insertions(+), 10 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 124029e..c3e3400 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5735,14 +5735,13 @@ static struct pernet_operations __net_initdata netdev_net_ops = {
static void __net_exit default_device_exit(struct net *net)
{
- struct net_device *dev;
+ struct net_device *dev, *aux;
/*
- * Push all migratable of the network devices back to the
+ * Push all migratable network devices back to the
* initial network namespace
*/
rtnl_lock();
-restart:
- for_each_netdev(net, dev) {
+ for_each_netdev_safe(net, dev, aux) {
int err;
char fb_name[IFNAMSIZ];
@@ -5750,11 +5749,9 @@ restart:
if (dev->features & NETIF_F_NETNS_LOCAL)
continue;
- /* Delete virtual devices */
- if (dev->rtnl_link_ops && dev->rtnl_link_ops->dellink) {
- dev->rtnl_link_ops->dellink(dev, NULL);
- goto restart;
- }
+ /* Leave virtual devices for the generic cleanup */
+ if (dev->rtnl_link_ops)
+ continue;
/* Push remaing network devices to init_net */
snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
@@ -5764,7 +5761,6 @@ restart:
__func__, dev->name, err);
BUG();
}
- goto restart;
}
rtnl_unlock();
}
--
1.6.5.2.143.g8cc62
--
From: Eric W. Biederman <ebiederm@xmission.com>
To get the full benefit of batched network namespace cleanup netowrk
device deletion needs to be performed by the generic code. When
using register_pernet_gen_device and freeing the data in exit_net
it is impossible to delay allocation until after exit_net has called
as the device uninit methods are no longer safe.
To correct this, and to simplify working with per network namespace data
I have moved allocation and deletion of per network namespace data into
the network namespace core. The core now frees the data only after
all of the network namespace exit routines have run.
Now it is only required to set the new fields .id and .size
in the pernet_operations structure if you want network namespace
data to be managed for you automatically.
This makes the current register_pernet_gen_device and
register_pernet_gen_subsys routines unnecessary. For the moment
I have left them as compatibility wrappers in net_namespace.h
They will be removed once all of the users have been updated.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
include/net/net_namespace.h | 28 ++++++-
net/core/net_namespace.c | 188 +++++++++++++++++++++++--------------------
2 files changed, 126 insertions(+), 90 deletions(-)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index d69b479..080774b 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -236,6 +236,8 @@ struct pernet_operations {
struct list_head list;
int (*init)(struct net *net);
void (*exit)(struct net *net);
+ int *id;
+ size_t size;
};
/*
@@ -259,12 +261,30 @@ struct pernet_operations {
*/
extern int register_pernet_subsys(struct pernet_operations *);
extern void unregister_pernet_subsys(struct pernet_operations *);
-extern int register_pernet_gen_subsys(int *id, struct pernet_operations *);
-extern void unregister_pernet_gen_subsys(int id, struct pernet_operations *);
extern int ...Excellent work Eric. I have time today so i can test these patches status quo vs. This is against net-next? cheers, jamal --
From: ebiederm@xmission.com (Eric W. Biederman) All applied, and assuming all of the build checks pass I'll push this out to net-next-2.6 I should look into that inet_twsk_purge performance issue you mention when tearing down a namespace. It walks the entire hash table and takes a lock for every hash chain. Eric, is it possible for us to at least slightly optimize this by doing a peek at whether the head pointer of each chain is NULL and bypass the spinlock and everything else in that case? Or is this not legal with sk_nulls? Something like: if (hlist_nulls_empty(&head->twchain)) continue; right before the 'restart' label? --
I haven't had a chance to wrap my head around that case fully yet. After playing with a few ideas I think what we want to do algorithmically is to have a batched flush like we do for the routing table cache. That should get the cost down to only about 100ms. Which is much better when you have a lot of them but is still a lot of time. From my preliminary investigation I believe we don't need to take any locks to traverse the hash table. I think we can do the entire hash table traversal under simple rcu protection, and only take the lock on the individual time wait entries to delete them. .... Also for best batching we have ipip, ipgre, ip6_tunnel, sit, vlan, and bridging that need to be taught to use rtnl_link_ops and let the generic code delete their devices. The changes for the vlan code are simple. The rest I haven't finished wrapping my head around the drivers individual requirements for. Still the changes should not be sweeping. Eric --
