Re: IPv4/IPv6 sysctl unregistration deadlock

Previous thread: [PATCH] zaurus: add usb id for motomagx phones by Dmitriy Taychenachev on Tuesday, February 24, 2009 - 9:42 pm. (1 message)

Next thread: Documentation:Update the INDEX of the documents by Yang Hongyang on Wednesday, February 25, 2009 - 12:16 am. (12 messages)
From: Patrick McHardy
Date: Tuesday, February 24, 2009 - 10:23 pm

Ben Greear reported sporadically hanging ip processes when adding or
removing MACVLAN devices, which seem to be caused by a race between
sysctl handling and device notifiers.

What is happening is:

Process 1:

- "ip" deletes a macvlan device (or any other kind of device gets
   removed for whatever reason)
- netdev notifier chain notifies IPv6 addrconf while holding the RTNL

Process 2:

- a different process writes something to /proc/sys/net/ipv6/forwarding
- sysctl table is marked as busy, addrconf_sysctl_forward() is invoked
- tries to take rtnl, waits for process 1

Process 1:

- IPv6 begins sysctl unregistration, which is deferred using a
   completion until the table is not busy anymore (=> waiting for
   process 2)

At this point both processes are deadlocked. Judging by a quick look,
IPv4 seems to have the exact same problem.

An easy fix would be to keep track of whether sysctl unregistration
is in progress in IPv4/IPv6 and ignore new requests from that point
on. Its not very elegant though, so I was wondering whether anyone
has a better suggestion.

--

From: Herbert Xu
Date: Tuesday, February 24, 2009 - 11:19 pm

We could make the unregistration asynchronous and invoke a callback
when it's done.  Then we can simply hold a net_device refcount and
relinquish it in the callback.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Patrick McHardy
Date: Tuesday, February 24, 2009 - 11:23 pm

That sounds simple enough. I'll see if I can come up with a patch, thanks.
--

From: Patrick McHardy
Date: Wednesday, February 25, 2009 - 12:18 am

Unfortunately its more complicated than I thought because of
device renames, where the sysctl pointer is reused after
unregistration and the rename/unregistration/re-registration
should be atomic. Deferring unregistration means we can't perform
the new registration immediately unless we allow multiple
registrations for a single device to be active simulaneously,
which introduces a whole new set of problems.

Simply ignoring the request during unregistration doesn't seem
so bad after all, the main problem is that it intoduces a different
race on renames where a write to the "forwarding" file returns
success, but the change doesn't take effect. We could return
-ENOENT, but that seems a bit strange after open() returned success.
Maybe -EBUSY, although I would prefer to make this transparent
to userspace.

Another alternative would be to simply not take the RTNL in
the sysctl handler since we're already taking dev_base_lock
before performing any forwaring changes. But in case of IPv4
we need it for disabling LRO.

I think I'm stuck. Will rethink it after some coffee :)

--

From: Herbert Xu
Date: Wednesday, February 25, 2009 - 1:43 am

I'd like to avoid that for the rename case just because shell
scripts know how to deal with echo foo > /nonexist/file but not

Yes we need more coffee :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Eric W. Biederman
Date: Wednesday, February 25, 2009 - 11:06 pm

From: Eric W. Biederman
Date: Wednesday, February 25, 2009 - 11:10 pm

How does adding a rename operation to sysctl sound?

I am a little concerned that if we have this issue with sysctl
we also have it with proc and sysfs as well.

Although I admit I don't understand it yet.

Eric


--

From: Herbert Xu
Date: Wednesday, February 25, 2009 - 11:22 pm

Yes that would definitely help.  Of course for the unregister case
we'd still need either an async removal or a no-op as Patrick

Indeed.  /proc should be OK because it's all read-only so we don't
take the RTNL anywhere.  But sysfs may well have the same issue.

I think we should watch out for any more attempts at adding
netdev name-based sysfs nodes and nip them in the bud.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Eric W. Biederman
Date: Thursday, February 26, 2009 - 12:18 am

After having reread the thread and looking at the code I think
I understand what is happening.

sysctl, proc, and sysfs all need to wait until there are no
more users before their unregister operation succeeds.  So that
we can guarantee that it is safe to remove a module that provides
the callback function.

Currently ndo_stop, NETDEV_DOWN, unlist_netdevice and I don't
know how much other code is run from unregister_netdevice
with the rtnl lock.  If we do an asynchronous unregister
we need to ensure that entire code path is safe without
rtnl_lock.  And we would need to run the unregister work
from rtnl_lock.

Ugh.  netdev_store() and a few other functions in net-sysfs.c
take rtnl_lock.  The instance in netdev_store appears to date
back to 21 May 2003 sometime during 2.5.

So this is an old problem that we are just noticing now. Ugh.

Currently rtnl_lock() protects the netdevice_notifier_chain.
So it appears we need to hold rtnl_lock().

Which leads me to conclude either we need to completely rewrite the
locking rules for the networking stack, or we need to teach the sysfs,
sysctl, and proc how to grab a subsystem lock around a callback.

We already do this for netlink with netlink_create_kernel.

So I guess we need a variants of:
register_sysctl_table, proc_create, and class_create_file.

What a pain, but at least it looks like it can work.

Eric
--

From: Stephen Hemminger
Date: Thursday, February 26, 2009 - 9:49 am

On Wed, 25 Feb 2009 23:18:42 -0800

What about something like this:

Subject: [PATCH] Avoid race between network down and sysfs


Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/core/net-sysfs.c	2009-02-26 08:36:18.000000000 -0800
+++ b/net/core/net-sysfs.c	2009-02-26 08:37:51.000000000 -0800
@@ -77,7 +77,9 @@ static ssize_t netdev_store(struct devic
 	if (endp == buf)
 		goto err;
 
-	rtnl_lock();
+	if (!rtnl_trylock())
+		return -ERESTARTSYS;
+
 	if (dev_isalive(net)) {
 		if ((ret = (*set)(net, new)) == 0)
 			ret = len;
--

From: Eric W. Biederman
Date: Thursday, February 26, 2009 - 12:01 pm

As far as solutions go.  That looks like the easiest correct solution.
So.  
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>


Will -ERESTARTSYS trigger the in kernel restart logic in this case?

There are a lot more cases to cover, and I don't I like it long
term.  Spinning waiting for rtnl_lock feels wrong.  Plus it does
not help with discovering the problem in new sysfs, sysctl, or
proc files.

--

From: Stephen Hemminger
Date: Thursday, February 26, 2009 - 1:24 pm

On Thu, 26 Feb 2009 11:01:41 -0800

I haven't tested it, but it should restart in VFS.
Spinning is not that big a deal, and it also handles the case
where the name changed or some other race occurred during processing.
When syscall is re-entered, the name will no longer be found and
-ENOENT will be returned.
--

From: Herbert Xu
Date: Thursday, February 26, 2009 - 5:59 pm

This is going to spin instead of sleep on contention, is that
intended?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Stephen Hemminger
Date: Thursday, February 26, 2009 - 6:25 pm

On Fri, 27 Feb 2009 08:59:45 +0800

It walks all the way back out to VFS, which is what you have to
do since the sysctl may move.
--

From: Ben Greear
Date: Friday, February 27, 2009 - 11:26 am

I can test this to see if it fixes my problem.  Are the above lines the
entirety of the patch?

I think I'll add a printk if we do the ERESTARTSYS to make sure that I
hit the race (but still recover).

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

--

From: Stephen Hemminger
Date: Friday, February 27, 2009 - 11:38 am

On Fri, 27 Feb 2009 10:26:43 -0800

yes
--

From: Patrick McHardy
Date: Monday, March 2, 2009 - 4:07 am

We should be able to avoid the restart looping in most cases,
we only need to do this while unregistration is in progress.

--

From: Patrick McHardy
Date: Monday, March 2, 2009 - 4:21 am

It doesn't seem to work. The idea was something like this:

static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, 
int old)
{
	struct inet6_dev *idev;
	struct net *net;

	net = (struct net *)table->extra2;
	if (p == &net->ipv6.devconf_dflt->forwarding)
		return 0;

	/* Unregistration in progress */
	idev = table->extra1;
	if (idev->cnf.sysctl == NULL)
		return -ERESTARTSYS;

	rtnl_lock();
...

but the process might have entered this function while the rtnl
is already held, but the unregistration hasn't been triggered yet.
So it would still deadlock.
--

From: Ben Greear
Date: Monday, March 2, 2009 - 3:11 pm

With both of Stephen's patches included in the latest -rc6 source,
I re-ran the test and it seems to be working (I added printks so
that I would know the new code was being exercised)

I had 2000 or so mac-vlans configured, with 10 of them being
re-configured concurrently, while also deleting groups of 20-100
mac-vlans in my test.  This was locking up reliably before,
and now it seems to be working fine.

Here's the kernel log showing the ERESTARTSYS in action.  I don't
have an easy way to check to see if the VFS (or whatever) retried
the call properly, but will let you all know if I see any indication
that isn't working.

I only saw the ipv6 fixup in my logs, but maybe my test case just
doesn't hit the other...

Mar  2 13:37:41 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:37:41 simech-1 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Mar  2 13:37:42 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:38:13 simech-1 kernel:last message repeated 42 times
Mar  2 13:39:20 simech-1 kernel:last message repeated 71 times
Mar  2 13:40:13 simech-1 kernel:last message repeated 68 times
Mar  2 13:40:13 simech-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 8
Mar  2 13:40:13 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:40:21 simech-1 kernel:last message repeated 11 times
Mar  2 13:40:21 simech-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 13
Mar  2 13:40:21 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:40:28 simech-1 kernel:last message repeated 16 times
Mar  2 13:40:28 simech-1 kernel: __ratelimit: 3352 callbacks suppressed
Mar  2 13:40:28 simech-1 kernel: Neighbour table overflow.
Mar  2 13:40:28 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:40:34 simech-1 ...
From: Patrick McHardy
Date: Monday, March 2, 2009 - 3:20 pm

This looks like its working fine. Despite the non-desirable active
spinning, this seems like the best fix (actually much simpler than
I expected to be possible) at this time. If we just could avoid
the spinning when unnecessary, it would be perfect :)
--

From: David Miller
Date: Monday, March 2, 2009 - 3:47 pm

From: Patrick McHardy <kaber@trash.net>

Could you give that "not actually in-progress" detection a shot?

I don't like the spinning either.
--

From: Patrick McHardy
Date: Monday, March 2, 2009 - 4:03 pm

I tried this morning, the problem is that its always the sysctl
handler which will run into the deadlock first, but there is no
reliable indication to avoid it other than that the RTNL is
already held. The problem is that the sysctl interface puts the
process holding the RTNL to sleep and allows a process requiring
it to run. Any different synchronization attempt will have the
same problem, it seems you simply can't hold any locks while
unregistering sysctls.
--

From: David Miller
Date: Tuesday, March 3, 2009 - 1:48 am

From: Patrick McHardy <kaber@trash.net>

Ok, I applied Stephen's patches then...
--

From: Eric W. Biederman
Date: Saturday, March 7, 2009 - 8:36 pm

From the further brainstorming department...

It appears we are using the wait for the completion
for two distinct purposes.

Waiting until it is safe to free storage.

Blocking module unregistration until the all of the
users of the code are gone.

I'm wondering if we could move the memory freeing.

And consolidate the waiting ultimately moving the
wait for completion into netdev_run_todo after
we drop the lock.

The reason I care is that it looks like to get hotplug
handling sane, I'm going to need to implement a generic
version of what sysfs, proc and sysctl are doing, and I
expect that generic version belongs in the vfs ultimately
giving us the capability to implement revoke.

So I am going to be revisiting how the code works, and if
I can come up with a cleaner solution to the networking
stack it would be a good time to implement it.

Eric
--

From: Stephen Hemminger
Date: Thursday, February 26, 2009 - 9:55 am

On Wed, 25 Feb 2009 08:18:47 +0100

Will the following help? It punts the problem back out to VFS which
will restart.

--- a/net/ipv6/addrconf.c	2009-02-26 08:51:09.000000000 -0800
+++ b/net/ipv6/addrconf.c	2009-02-26 08:54:08.000000000 -0800
@@ -493,15 +493,17 @@ static void addrconf_forward_change(stru
 	read_unlock(&dev_base_lock);
 }
 
-static void addrconf_fixup_forwarding(struct ctl_table *table, int *p, int old)
+static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int old)
 {
 	struct net *net;
 
 	net = (struct net *)table->extra2;
 	if (p == &net->ipv6.devconf_dflt->forwarding)
-		return;
+		return 0;
+
+	if (!rtnl_trylock())
+		return -ERESTARTSYS;
 
-	rtnl_lock();
 	if (p == &net->ipv6.devconf_all->forwarding) {
 		__s32 newf = net->ipv6.devconf_all->forwarding;
 		net->ipv6.devconf_dflt->forwarding = newf;
@@ -512,6 +514,7 @@ static void addrconf_fixup_forwarding(st
 
 	if (*p)
 		rt6_purge_dflt_routers(net);
+	return 1;
 }
 #endif
 
@@ -3977,7 +3980,7 @@ int addrconf_sysctl_forward(ctl_table *c
 	ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos);
 
 	if (write)
-		addrconf_fixup_forwarding(ctl, valp, val);
+		ret = addrconf_fixup_forwarding(ctl, valp, val);
 	return ret;
 }
 
@@ -4013,8 +4016,7 @@ static int addrconf_sysctl_forward_strat
 	}
 
 	*valp = new;
-	addrconf_fixup_forwarding(table, valp, val);
-	return 1;
+	return addrconf_fixup_forwarding(table, valp, val);
 }
 
 static struct addrconf_sysctl_table
--

Previous thread: [PATCH] zaurus: add usb id for motomagx phones by Dmitriy Taychenachev on Tuesday, February 24, 2009 - 9:42 pm. (1 message)

Next thread: Documentation:Update the INDEX of the documents by Yang Hongyang on Wednesday, February 25, 2009 - 12:16 am. (12 messages)