Thanks for pointing out this gaping hole in my patch. I think it may be better to remove read_wait altogether and just use socket.wait in tun_struct. Let me whip up a patch. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
tun: Fix sk_sleep races when attaching/detaching
As the sk_sleep wait queue actually lives in tfile, which may be
detached from the tun device, bad things will happen when we use
sk_sleep after detaching.
Since the tun device is the persistent data structure here (when
requested by the user), it makes much more sense to have the wait
queue live there. There is no reason to have it in tfile at all
since the only time we can wait is if we have a tun attached.
In fact we already have a wait queue in tun_struct, so we might
as well use it.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1b0697..412b578 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -93,7 +93,6 @@ struct tun_file {
atomic_t count;
struct tun_struct *tun;
struct net *net;
- wait_queue_head_t read_wait;
};
struct tun_sock;
@@ -333,7 +332,7 @@ static void tun_net_uninit(struct net_device *dev)
/* Inform the methods they need to stop using the dev.
*/
if (tfile) {
- wake_up_all(&tfile->read_wait);
+ wake_up_all(&tun->socket.wait);
if (atomic_dec_and_test(&tfile->count))
__tun_detach(tun);
}
@@ -393,7 +392,7 @@ static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
/* Notify and wake up reader process */
if (tun->flags & TUN_FASYNC)
kill_fasync(&tun->fasync, SIGIO, POLL_IN);
- wake_up_interruptible(&tun->tfile->read_wait);
+ wake_up_interruptible(&tun->socket.wait);
return 0;
drop:
@@ -490,7 +489,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name);
- poll_wait(file, &tfile->read_wait, wait);
+ poll_wait(file, &tun->socket.wait, wait);
if (!skb_queue_empty(&tun->readq))
mask |= POLLIN | POLLRDNORM;
@@ -762,7 +761,7 ...Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> --
Tested and works fine, thanks Herbert. --
There is a GIGANTIC reason to have the wait queue on tfile.
If you open a file, and do ip link del tapN you can still
be blocked waiting in poll.
The problem is specifically free_poll_entry, where we call
remove_wait_queue and fput without calling any file methods.
So all of this happens without struct tun_file's count being
elevated. Which means tun_net_uninit can detach before we get
off of the stupid poll wait queue.
As documented in:
commit b2430de37ef0bc0799ffba7b5219d38ca417eb76
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Tue Jan 20 11:03:21 2009 +0000
tun: Move read_wait into tun_file
The poll interface requires that the waitqueue exist while the struct
file is open. In the rare case when a tun device disappears before
the tun file closes we fail to provide this property, so move
read_wait.
This is safe now that tun_net_xmit is atomic with tun_detach.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I specifically moved the wait queue out of tun struct to avoid this
race.
I will see about getting the vfs to do something saner in my generic
revoke work. But for 2.6.30 we have to live with the nasties that
are there.
Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Eric
--
What about taking a netdev refcount before calling poll_wait? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Because as far as I can tell we would just leak that refcount. The poll code does not appear to call back into any of the file methods when it frees itself from the wait queue. Eric --
OK my suggestion was stupid. But I still don't see how this race is possible at all. So process A has a tun fd open and is spinning in poll(2). Now process B comes along and deletes that tun device. Process A's fd should have a netdev reference that keeps the device and associated structures alive. Oh I see what's going on. We're automatically detaching the device in uninit. This is just wrong. Just because process B deleted the netdev, process A should not be involutarily detached. Does anything actually rely on this behaviour? If not we should just change it to not do that. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
It appears that this was introduced in
commit c70f182940f988448f3c12a209d18b1edc276e33
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Tue Jan 20 11:07:17 2009 +0000
tun: Fix races between tun_net_close and free_netdev.
Presumably in order to fix the problem of trying to unregister
the same device twice.
I what we should do is to mark the device as dead instead of
detaching if a third party deletes it. That's all you need
to know to stop close(2) from trying the unregister a device
that's already been unregistered.
What else am I missing?
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Hopefully my earlier explanation helps. I will get back to you as soon as I can. But I am off on vacation for the rest of the week. Eric --
I'm sorry but I don't think we can wait for that. We might just have to fix it whatever way we can. You can unfix it when you come back :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Yes. There is the boring rmmod case that has always existed. There is more interesting case of moving your tap device into another network namespace. In which case there is the possibility of the network namespace exiting and destroying all of the virtual network devices before we close the file handle. I implemented the rtnetlink methods so we can test the uninit behavior without going through all sorts of hoops. Eric --
In that case what's the problem with holding a refcount to the unregistered device until the process owning the fd closes it? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Network devices do not hold a network namespace alive. Only sockets and processes do. So holding the reference only blocks us indefinitely in netdev_wait_allrefs, blocking the network namespace exit, and holding net_mutex indefinitely. My gut feel is that the socket needs to live in tun_file. Instead of in tun_struct. Making that change looked just tricky enough I couldn't sort through it when I glanced at the tun code, after I noticed you had added a socket. Eric --
Referring to tun_file to get sk_sleep is just too error-prone. Unlike the transmit direction, the receive direction does not present itself to the easy xmit lock solution that's currently used to make tun_detach atomic. The receive callback that currently uses sk_sleep can happen anywhere. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
OK that's a killer because process A in my previous scenario can kill the device itself and cause a dead-lock. So how about this? We replace the dev destructor with our own that doesn't immediately call free_netdev. We only call free_netdev once all tun fd's attached to the device have been closed. This can be achieved by simply adding a counter to tun_struct. We'd also change the async detach path to set a marker instead of detaching. That marker can then be checked in places like tun_get. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
That sounds like it would work, and allow us to have big tun_struct. Which is sounds simpler overall. I still have the feeling that putting the socket in tun_file instead of in tun_struct might be conceptually cleaner, but one big blob that is allocated and destroyed together is certainly easier and a lot less racy to deal with. Eric --
As I said the difficulty with putting the socket in tun_file is how do you get it on the RX callback path without introducing new races. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Here's the patch. I'd appreciate if everyone can review it and see if they can recreate the original race by 1) Starting a process using tun and polls on it; 2) Doing ip tun del tunXXX while the process is polling. tun: Only free a netdev when all tun descriptors are closed The commit c70f182940f988448f3c12a209d18b1edc276e33 ("tun: Fix races between tun_net_close and free_netdev") fixed a race where an asynchronous deletion of a tun device can hose a poll(2) on a tun fd attached to that device. However, this came at the cost of moving the tun wait queue into the tun file data structure. The problem with this is that it imposes restrictions on when and where the tun device can access the wait queue since the tun file may change at any time due to detaching and reattaching. In particular, now that we need to use the wait queue on the receive path it becomes difficult to properly synchronise this with the detachment of the tun device. This patch solves the original race in a different way. Since the race is only because the underlying memory gets freed, we can prevent it simply by ensuring that we don't do that until all tun descriptors ever attached to the device (even if they have since be detached because they may still be sitting in poll) have been closed. This is done by using reference counting the attached tun file descriptors. The refcount in tun->sk has been reappropriated for this purpose since it was already being used for that, albeit from the opposite angle. Note that we no longer zero tfile->tun since tun_get will return NULL anyway after the refcount on tfile hits zero. Instead it represents whether this device has ever been attached to a device. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/drivers/net/tun.c b/drivers/net/tun.c index a1b0697..540f829 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -156,6 +156,7 @@ static int tun_attach(struct tun_struct *tun, struct file *file) tfile->tun = tun; ...
With that patch we can now safely move read_wait.
tun: Fix sk_sleep races when attaching/detaching
As the sk_sleep wait queue actually lives in tfile, which may be
detached from the tun device, bad things will happen when we use
sk_sleep after detaching.
Since the tun device is the persistent data structure here (when
requested by the user), it makes much more sense to have the wait
queue live there. There is no reason to have it in tfile at all
since the only time we can wait is if we have a tun attached.
In fact we already have a wait queue in tun_struct, so we might
as well use it.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 540f829..7cfe3d1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -93,7 +93,6 @@ struct tun_file {
atomic_t count;
struct tun_struct *tun;
struct net *net;
- wait_queue_head_t read_wait;
};
struct tun_sock;
@@ -331,7 +330,7 @@ static void tun_net_uninit(struct net_device *dev)
/* Inform the methods they need to stop using the dev.
*/
if (tfile) {
- wake_up_all(&tfile->read_wait);
+ wake_up_all(&tun->socket.wait);
if (atomic_dec_and_test(&tfile->count))
__tun_detach(tun);
}
@@ -398,7 +397,7 @@ static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
/* Notify and wake up reader process */
if (tun->flags & TUN_FASYNC)
kill_fasync(&tun->fasync, SIGIO, POLL_IN);
- wake_up_interruptible(&tun->tfile->read_wait);
+ wake_up_interruptible(&tun->socket.wait);
return 0;
drop:
@@ -495,7 +494,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name);
- poll_wait(file, &tfile->read_wait, wait);
+ poll_wait(file, &tun->socket.wait, wait);
if (!skb_queue_empty(&tun->readq))
...That patch doesn't apply anymore because of contextual changes
caused by the first patch. Here's an update.
tun: Fix sk_sleep races when attaching/detaching
As the sk_sleep wait queue actually lives in tfile, which may be
detached from the tun device, bad things will happen when we use
sk_sleep after detaching.
Since the tun device is the persistent data structure here (when
requested by the user), it makes much more sense to have the wait
queue live there. There is no reason to have it in tfile at all
since the only time we can wait is if we have a tun attached.
In fact we already have a wait queue in tun_struct, so we might
as well use it.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 95ae40a..735bf41 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -93,7 +93,6 @@ struct tun_file {
atomic_t count;
struct tun_struct *tun;
struct net *net;
- wait_queue_head_t read_wait;
};
struct tun_sock;
@@ -331,7 +330,7 @@ static void tun_net_uninit(struct net_device *dev)
/* Inform the methods they need to stop using the dev.
*/
if (tfile) {
- wake_up_all(&tfile->read_wait);
+ wake_up_all(&tun->socket.wait);
if (atomic_dec_and_test(&tfile->count))
__tun_detach(tun);
}
@@ -398,7 +397,7 @@ static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
/* Notify and wake up reader process */
if (tun->flags & TUN_FASYNC)
kill_fasync(&tun->fasync, SIGIO, POLL_IN);
- wake_up_interruptible(&tun->tfile->read_wait);
+ wake_up_interruptible(&tun->socket.wait);
return 0;
drop:
@@ -495,7 +494,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name);
- poll_wait(file, &tfile->read_wait, wait);
+ poll_wait(file, ...From: Herbert Xu <herbert@gondor.apana.org.au> Do you think these two patches are ready to go into net-2.6 now? Thanks. --
I think so. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> Great, applied, thanks Herbert. --
Sorry for the late reply, but this patch creates another regression: Now TUNSETIFF returns EBUSY on a persistent tap device: open("/dev/net/tun", O_RDWR) = 11 ioctl(11, 0x400454ca, 0x3ffff8e2270) = -1 EBUSY (Device or resource busy) Some debugging (thanks to Carsten Otte) showed that tun_set_iff calls tun_attach (the first call inside the if(dev)). tun_attach now checks for tun->tfile but this is already set. Looks like we need another fix on top :-( Christian --
The patch you qouted has been superceded by many versions :) Can you please test the latest that's in davem's tree? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Done. With http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fdavem%2Fnet-2.6.git;a=commitdiff_plain... http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fdavem%2Fnet-2.6.git;a=commitdiff_plain... Everything works fine. Thank you and sorry for the noise. Christian --
