Ben Greear reported sporadically hanging ip processes when adding or removing MACVLAN devices, which seem to be caused by a race between sysctl handling and device notifiers. What is happening is: Process 1: - "ip" deletes a macvlan device (or any other kind of device gets removed for whatever reason) - netdev notifier chain notifies IPv6 addrconf while holding the RTNL Process 2: - a different process writes something to /proc/sys/net/ipv6/forwarding - sysctl table is marked as busy, addrconf_sysctl_forward() is invoked - tries to take rtnl, waits for process 1 Process 1: - IPv6 begins sysctl unregistration, which is deferred using a completion until the table is not busy anymore (=> waiting for process 2) At this point both processes are deadlocked. Judging by a quick look, IPv4 seems to have the exact same problem. An easy fix would be to keep track of whether sysctl unregistration is in progress in IPv4/IPv6 and ignore new requests from that point on. Its not very elegant though, so I was wondering whether anyone has a better suggestion. --
We could make the unregistration asynchronous and invoke a callback when it's done. Then we can simply hold a net_device refcount and relinquish it in the callback. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
That sounds simple enough. I'll see if I can come up with a patch, thanks. --
Unfortunately its more complicated than I thought because of device renames, where the sysctl pointer is reused after unregistration and the rename/unregistration/re-registration should be atomic. Deferring unregistration means we can't perform the new registration immediately unless we allow multiple registrations for a single device to be active simulaneously, which introduces a whole new set of problems. Simply ignoring the request during unregistration doesn't seem so bad after all, the main problem is that it intoduces a different race on renames where a write to the "forwarding" file returns success, but the change doesn't take effect. We could return -ENOENT, but that seems a bit strange after open() returned success. Maybe -EBUSY, although I would prefer to make this transparent to userspace. Another alternative would be to simply not take the RTNL in the sysctl handler since we're already taking dev_base_lock before performing any forwaring changes. But in case of IPv4 we need it for disabling LRO. I think I'm stuck. Will rethink it after some coffee :) --
I'd like to avoid that for the rename case just because shell scripts know how to deal with echo foo > /nonexist/file but not Yes we need more coffee :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
How does adding a rename operation to sysctl sound? I am a little concerned that if we have this issue with sysctl we also have it with proc and sysfs as well. Although I admit I don't understand it yet. Eric --
Yes that would definitely help. Of course for the unregister case we'd still need either an async removal or a no-op as Patrick Indeed. /proc should be OK because it's all read-only so we don't take the RTNL anywhere. But sysfs may well have the same issue. I think we should watch out for any more attempts at adding netdev name-based sysfs nodes and nip them in the bud. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
After having reread the thread and looking at the code I think I understand what is happening. sysctl, proc, and sysfs all need to wait until there are no more users before their unregister operation succeeds. So that we can guarantee that it is safe to remove a module that provides the callback function. Currently ndo_stop, NETDEV_DOWN, unlist_netdevice and I don't know how much other code is run from unregister_netdevice with the rtnl lock. If we do an asynchronous unregister we need to ensure that entire code path is safe without rtnl_lock. And we would need to run the unregister work from rtnl_lock. Ugh. netdev_store() and a few other functions in net-sysfs.c take rtnl_lock. The instance in netdev_store appears to date back to 21 May 2003 sometime during 2.5. So this is an old problem that we are just noticing now. Ugh. Currently rtnl_lock() protects the netdevice_notifier_chain. So it appears we need to hold rtnl_lock(). Which leads me to conclude either we need to completely rewrite the locking rules for the networking stack, or we need to teach the sysfs, sysctl, and proc how to grab a subsystem lock around a callback. We already do this for netlink with netlink_create_kernel. So I guess we need a variants of: register_sysctl_table, proc_create, and class_create_file. What a pain, but at least it looks like it can work. Eric --
On Wed, 25 Feb 2009 23:18:42 -0800
What about something like this:
Subject: [PATCH] Avoid race between network down and sysfs
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/net/core/net-sysfs.c 2009-02-26 08:36:18.000000000 -0800
+++ b/net/core/net-sysfs.c 2009-02-26 08:37:51.000000000 -0800
@@ -77,7 +77,9 @@ static ssize_t netdev_store(struct devic
if (endp == buf)
goto err;
- rtnl_lock();
+ if (!rtnl_trylock())
+ return -ERESTARTSYS;
+
if (dev_isalive(net)) {
if ((ret = (*set)(net, new)) == 0)
ret = len;
--
As far as solutions go. That looks like the easiest correct solution. So. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Will -ERESTARTSYS trigger the in kernel restart logic in this case? There are a lot more cases to cover, and I don't I like it long term. Spinning waiting for rtnl_lock feels wrong. Plus it does not help with discovering the problem in new sysfs, sysctl, or proc files. --
On Thu, 26 Feb 2009 11:01:41 -0800 I haven't tested it, but it should restart in VFS. Spinning is not that big a deal, and it also handles the case where the name changed or some other race occurred during processing. When syscall is re-entered, the name will no longer be found and -ENOENT will be returned. --
This is going to spin instead of sleep on contention, is that intended? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
On Fri, 27 Feb 2009 08:59:45 +0800 It walks all the way back out to VFS, which is what you have to do since the sysctl may move. --
I can test this to see if it fixes my problem. Are the above lines the entirety of the patch? I think I'll add a printk if we do the ERESTARTSYS to make sure that I hit the race (but still recover). Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com --
On Fri, 27 Feb 2009 10:26:43 -0800 yes --
We should be able to avoid the restart looping in most cases, we only need to do this while unregistration is in progress. --
It doesn't seem to work. The idea was something like this:
static int addrconf_fixup_forwarding(struct ctl_table *table, int *p,
int old)
{
struct inet6_dev *idev;
struct net *net;
net = (struct net *)table->extra2;
if (p == &net->ipv6.devconf_dflt->forwarding)
return 0;
/* Unregistration in progress */
idev = table->extra1;
if (idev->cnf.sysctl == NULL)
return -ERESTARTSYS;
rtnl_lock();
...
but the process might have entered this function while the rtnl
is already held, but the unregistration hasn't been triggered yet.
So it would still deadlock.
--
With both of Stephen's patches included in the latest -rc6 source, I re-ran the test and it seems to be working (I added printks so that I would know the new code was being exercised) I had 2000 or so mac-vlans configured, with 10 of them being re-configured concurrently, while also deleting groups of 20-100 mac-vlans in my test. This was locking up reliably before, and now it seems to be working fine. Here's the kernel log showing the ERESTARTSYS in action. I don't have an easy way to check to see if the VFS (or whatever) retried the call properly, but will let you all know if I see any indication that isn't working. I only saw the ipv6 fixup in my logs, but maybe my test case just doesn't hit the other... Mar 2 13:37:41 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock. Mar 2 13:37:41 simech-1 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready Mar 2 13:37:42 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock. Mar 2 13:38:13 simech-1 kernel:last message repeated 42 times Mar 2 13:39:20 simech-1 kernel:last message repeated 71 times Mar 2 13:40:13 simech-1 kernel:last message repeated 68 times Mar 2 13:40:13 simech-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 8 Mar 2 13:40:13 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock. Mar 2 13:40:21 simech-1 kernel:last message repeated 11 times Mar 2 13:40:21 simech-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 13 Mar 2 13:40:21 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock. Mar 2 13:40:28 simech-1 kernel:last message repeated 16 times Mar 2 13:40:28 simech-1 kernel: __ratelimit: 3352 callbacks suppressed Mar 2 13:40:28 simech-1 kernel: Neighbour table overflow. Mar 2 13:40:28 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock. Mar 2 13:40:34 simech-1 ...
This looks like its working fine. Despite the non-desirable active spinning, this seems like the best fix (actually much simpler than I expected to be possible) at this time. If we just could avoid the spinning when unnecessary, it would be perfect :) --
From: Patrick McHardy <kaber@trash.net> Could you give that "not actually in-progress" detection a shot? I don't like the spinning either. --
I tried this morning, the problem is that its always the sysctl handler which will run into the deadlock first, but there is no reliable indication to avoid it other than that the RTNL is already held. The problem is that the sysctl interface puts the process holding the RTNL to sleep and allows a process requiring it to run. Any different synchronization attempt will have the same problem, it seems you simply can't hold any locks while unregistering sysctls. --
From: Patrick McHardy <kaber@trash.net> Ok, I applied Stephen's patches then... --
From the further brainstorming department... It appears we are using the wait for the completion for two distinct purposes. Waiting until it is safe to free storage. Blocking module unregistration until the all of the users of the code are gone. I'm wondering if we could move the memory freeing. And consolidate the waiting ultimately moving the wait for completion into netdev_run_todo after we drop the lock. The reason I care is that it looks like to get hotplug handling sane, I'm going to need to implement a generic version of what sysfs, proc and sysctl are doing, and I expect that generic version belongs in the vfs ultimately giving us the capability to implement revoke. So I am going to be revisiting how the code works, and if I can come up with a cleaner solution to the networking stack it would be a good time to implement it. Eric --
On Wed, 25 Feb 2009 08:18:47 +0100
Will the following help? It punts the problem back out to VFS which
will restart.
--- a/net/ipv6/addrconf.c 2009-02-26 08:51:09.000000000 -0800
+++ b/net/ipv6/addrconf.c 2009-02-26 08:54:08.000000000 -0800
@@ -493,15 +493,17 @@ static void addrconf_forward_change(stru
read_unlock(&dev_base_lock);
}
-static void addrconf_fixup_forwarding(struct ctl_table *table, int *p, int old)
+static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int old)
{
struct net *net;
net = (struct net *)table->extra2;
if (p == &net->ipv6.devconf_dflt->forwarding)
- return;
+ return 0;
+
+ if (!rtnl_trylock())
+ return -ERESTARTSYS;
- rtnl_lock();
if (p == &net->ipv6.devconf_all->forwarding) {
__s32 newf = net->ipv6.devconf_all->forwarding;
net->ipv6.devconf_dflt->forwarding = newf;
@@ -512,6 +514,7 @@ static void addrconf_fixup_forwarding(st
if (*p)
rt6_purge_dflt_routers(net);
+ return 1;
}
#endif
@@ -3977,7 +3980,7 @@ int addrconf_sysctl_forward(ctl_table *c
ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos);
if (write)
- addrconf_fixup_forwarding(ctl, valp, val);
+ ret = addrconf_fixup_forwarding(ctl, valp, val);
return ret;
}
@@ -4013,8 +4016,7 @@ static int addrconf_sysctl_forward_strat
}
*valp = new;
- addrconf_fixup_forwarding(table, valp, val);
- return 1;
+ return addrconf_fixup_forwarding(table, valp, val);
}
static struct addrconf_sysctl_table
--
