Re: [Lguest] [PATCH 4/5] lguest: use KVM hypercalls

Previous thread: Let Accai Berry imporve your health by Tanya on Tuesday, April 14, 2009 - 12:52 pm. (1 message)

Next thread: Acai Super Berry Capsules, you will love by Marcia Shafer on Tuesday, April 14, 2009 - 11:48 am. (1 message)
From: Herbert Xu
Date: Wednesday, April 15, 2009 - 1:36 am

Thanks for pointing out this gaping hole in my patch.  I think
it may be better to remove read_wait altogether and just use
socket.wait in tun_struct.

Let me whip up a patch.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Wednesday, April 15, 2009 - 1:47 am

tun: Fix sk_sleep races when attaching/detaching

As the sk_sleep wait queue actually lives in tfile, which may be
detached from the tun device, bad things will happen when we use
sk_sleep after detaching.

Since the tun device is the persistent data structure here (when
requested by the user), it makes much more sense to have the wait
queue live there.  There is no reason to have it in tfile at all
since the only time we can wait is if we have a tun attached.
In fact we already have a wait queue in tun_struct, so we might
as well use it.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1b0697..412b578 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -93,7 +93,6 @@ struct tun_file {
 	atomic_t count;
 	struct tun_struct *tun;
 	struct net *net;
-	wait_queue_head_t	read_wait;
 };
 
 struct tun_sock;
@@ -333,7 +332,7 @@ static void tun_net_uninit(struct net_device *dev)
 	/* Inform the methods they need to stop using the dev.
 	 */
 	if (tfile) {
-		wake_up_all(&tfile->read_wait);
+		wake_up_all(&tun->socket.wait);
 		if (atomic_dec_and_test(&tfile->count))
 			__tun_detach(tun);
 	}
@@ -393,7 +392,7 @@ static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Notify and wake up reader process */
 	if (tun->flags & TUN_FASYNC)
 		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
-	wake_up_interruptible(&tun->tfile->read_wait);
+	wake_up_interruptible(&tun->socket.wait);
 	return 0;
 
 drop:
@@ -490,7 +489,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
 
 	DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name);
 
-	poll_wait(file, &tfile->read_wait, wait);
+	poll_wait(file, &tun->socket.wait, wait);
 
 	if (!skb_queue_empty(&tun->readq))
 		mask |= POLLIN | POLLRDNORM;
@@ -762,7 +761,7 ...
From: Christian Borntraeger
Date: Wednesday, April 15, 2009 - 2:07 am

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
--

From: Patrick McHardy
Date: Wednesday, April 15, 2009 - 4:07 am

Tested and works fine, thanks Herbert.
--

From: Eric W. Biederman
Date: Wednesday, April 15, 2009 - 6:23 am

There is a GIGANTIC reason to have the wait queue on tfile.

If you open a file, and do ip link del tapN you can still
be blocked waiting in poll.

The problem is specifically free_poll_entry, where we call
remove_wait_queue and fput without calling any file methods.
So all of this happens without struct tun_file's count being
elevated.  Which means tun_net_uninit can detach before we get
off of the stupid poll wait queue.

As documented in:

commit b2430de37ef0bc0799ffba7b5219d38ca417eb76
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Jan 20 11:03:21 2009 +0000

    tun: Move read_wait into tun_file
    
    The poll interface requires that the waitqueue exist while the struct
    file is open.  In the rare case when a tun device disappears before
    the tun file closes we fail to provide this property, so move
    read_wait.
    
    This is safe now that tun_net_xmit is atomic with tun_detach.
    
    Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>



I specifically moved the wait queue out of tun struct to avoid this
race.

I will see about getting the vfs to do something saner in my generic
revoke work.  But for 2.6.30 we have to live with the nasties that
are there.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Eric
--

From: Herbert Xu
Date: Wednesday, April 15, 2009 - 6:28 am

What about taking a netdev refcount before calling poll_wait?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Eric W. Biederman
Date: Wednesday, April 15, 2009 - 6:35 am

Because as far as I can tell we would just leak that refcount.

The poll code does not appear to call back into any of the file
methods when it frees itself from the wait queue.

Eric
--

From: Herbert Xu
Date: Wednesday, April 15, 2009 - 6:46 am

OK my suggestion was stupid.

But I still don't see how this race is possible at all.

So process A has a tun fd open and is spinning in poll(2).  Now
process B comes along and deletes that tun device.  Process A's
fd should have a netdev reference that keeps the device and
associated structures alive.

Oh I see what's going on.  We're automatically detaching the
device in uninit.  This is just wrong.  Just because process B
deleted the netdev, process A should not be involutarily detached.

Does anything actually rely on this behaviour?

If not we should just change it to not do that.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Wednesday, April 15, 2009 - 6:55 am

It appears that this was introduced in

commit c70f182940f988448f3c12a209d18b1edc276e33
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Jan 20 11:07:17 2009 +0000

    tun: Fix races between tun_net_close and free_netdev.

Presumably in order to fix the problem of trying to unregister
the same device twice.

I what we should do is to mark the device as dead instead of
detaching if a third party deletes it.  That's all you need
to know to stop close(2) from trying the unregister a device
that's already been unregistered.

What else am I missing?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Eric W. Biederman
Date: Wednesday, April 15, 2009 - 7:10 am

Hopefully my earlier explanation helps.  I will get back to
you as soon as I can.  But I am off on vacation for the
rest of the week.

Eric
--

From: Herbert Xu
Date: Wednesday, April 15, 2009 - 7:12 am

I'm sorry but I don't think we can wait for that.  We might just
have to fix it whatever way we can.  You can unfix it when you
come back :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Eric W. Biederman
Date: Wednesday, April 15, 2009 - 7:06 am

Yes.

There is the boring rmmod case that has always existed.

There is more interesting case of moving your tap device
into another network namespace.

In which case there is the possibility of the network namespace
exiting and destroying all of the virtual network devices before
we close the file handle.

I implemented the rtnetlink methods so we can test the uninit
behavior without going through all sorts of hoops.

Eric
--

From: Herbert Xu
Date: Wednesday, April 15, 2009 - 7:08 am

In that case what's the problem with holding a refcount to the
unregistered device until the process owning the fd closes it?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Eric W. Biederman
Date: Wednesday, April 15, 2009 - 7:18 am

Network devices do not hold a network namespace alive.  Only sockets
and processes do.

So holding the reference only blocks us indefinitely in
netdev_wait_allrefs, blocking the network namespace exit, and holding
net_mutex indefinitely.

My gut feel is that the socket needs to live in tun_file.  Instead
of in tun_struct.  Making that change looked just tricky enough
I couldn't sort through it when I glanced at the tun code, after I noticed
you had added a socket.

Eric
--

From: Herbert Xu
Date: Wednesday, April 15, 2009 - 7:23 am

Referring to tun_file to get sk_sleep is just too error-prone.
Unlike the transmit direction, the receive direction does not
present itself to the easy xmit lock solution that's currently
used to make tun_detach atomic.

The receive callback that currently uses sk_sleep can happen
anywhere.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Wednesday, April 15, 2009 - 7:38 am

OK that's a killer because process A in my previous scenario can
kill the device itself and cause a dead-lock.

So how about this? We replace the dev destructor with our own that
doesn't immediately call free_netdev.  We only call free_netdev once
all tun fd's attached to the device have been closed.

This can be achieved by simply adding a counter to tun_struct.
We'd also change the async detach path to set a marker instead
of detaching.  That marker can then be checked in places like
tun_get.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Eric W. Biederman
Date: Wednesday, April 15, 2009 - 7:56 am

That sounds like it would work, and allow us to have big tun_struct.
Which is sounds simpler overall.

I still have the feeling that putting the socket in tun_file instead
of in tun_struct might be conceptually cleaner, but one big blob that
is allocated and destroyed together is certainly easier and a lot
less racy to deal with.

Eric
--

From: Herbert Xu
Date: Wednesday, April 15, 2009 - 3:27 pm

As I said the difficulty with putting the socket in tun_file
is how do you get it on the RX callback path without introducing
new races.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Thursday, April 16, 2009 - 4:08 am

Here's the patch.  I'd appreciate if everyone can review it
and see if they can recreate the original race by

1) Starting a process using tun and polls on it;
2) Doing ip tun del tunXXX while the process is polling.

tun: Only free a netdev when all tun descriptors are closed

The commit c70f182940f988448f3c12a209d18b1edc276e33 ("tun: Fix
races between tun_net_close and free_netdev") fixed a race where
an asynchronous deletion of a tun device can hose a poll(2) on
a tun fd attached to that device.

However, this came at the cost of moving the tun wait queue into
the tun file data structure.  The problem with this is that it
imposes restrictions on when and where the tun device can access
the wait queue since the tun file may change at any time due to
detaching and reattaching.

In particular, now that we need to use the wait queue on the
receive path it becomes difficult to properly synchronise this
with the detachment of the tun device.

This patch solves the original race in a different way.  Since
the race is only because the underlying memory gets freed, we
can prevent it simply by ensuring that we don't do that until
all tun descriptors ever attached to the device (even if they
have since be detached because they may still be sitting in poll)
have been closed.

This is done by using reference counting the attached tun file
descriptors.  The refcount in tun->sk has been reappropriated
for this purpose since it was already being used for that, albeit
from the opposite angle.

Note that we no longer zero tfile->tun since tun_get will return
NULL anyway after the refcount on tfile hits zero.  Instead it
represents whether this device has ever been attached to a device.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1b0697..540f829 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -156,6 +156,7 @@ static int tun_attach(struct tun_struct *tun, struct file *file)
 	tfile->tun = tun;
 ...
From: Herbert Xu
Date: Thursday, April 16, 2009 - 4:09 am

With that patch we can now safely move read_wait.

tun: Fix sk_sleep races when attaching/detaching

As the sk_sleep wait queue actually lives in tfile, which may be
detached from the tun device, bad things will happen when we use
sk_sleep after detaching.

Since the tun device is the persistent data structure here (when
requested by the user), it makes much more sense to have the wait
queue live there.  There is no reason to have it in tfile at all
since the only time we can wait is if we have a tun attached.
In fact we already have a wait queue in tun_struct, so we might
as well use it.

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 540f829..7cfe3d1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -93,7 +93,6 @@ struct tun_file {
 	atomic_t count;
 	struct tun_struct *tun;
 	struct net *net;
-	wait_queue_head_t	read_wait;
 };
 
 struct tun_sock;
@@ -331,7 +330,7 @@ static void tun_net_uninit(struct net_device *dev)
 	/* Inform the methods they need to stop using the dev.
 	 */
 	if (tfile) {
-		wake_up_all(&tfile->read_wait);
+		wake_up_all(&tun->socket.wait);
 		if (atomic_dec_and_test(&tfile->count))
 			__tun_detach(tun);
 	}
@@ -398,7 +397,7 @@ static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Notify and wake up reader process */
 	if (tun->flags & TUN_FASYNC)
 		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
-	wake_up_interruptible(&tun->tfile->read_wait);
+	wake_up_interruptible(&tun->socket.wait);
 	return 0;
 
 drop:
@@ -495,7 +494,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
 
 	DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name);
 
-	poll_wait(file, &tfile->read_wait, wait);
+	poll_wait(file, &tun->socket.wait, wait);
 
 	if (!skb_queue_empty(&tun->readq))
 ...
From: Herbert Xu
Date: Monday, April 20, 2009 - 1:35 am

That patch doesn't apply anymore because of contextual changes
caused by the first patch.  Here's an update.

tun: Fix sk_sleep races when attaching/detaching

As the sk_sleep wait queue actually lives in tfile, which may be
detached from the tun device, bad things will happen when we use
sk_sleep after detaching.

Since the tun device is the persistent data structure here (when
requested by the user), it makes much more sense to have the wait
queue live there.  There is no reason to have it in tfile at all
since the only time we can wait is if we have a tun attached.
In fact we already have a wait queue in tun_struct, so we might
as well use it.

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 95ae40a..735bf41 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -93,7 +93,6 @@ struct tun_file {
 	atomic_t count;
 	struct tun_struct *tun;
 	struct net *net;
-	wait_queue_head_t	read_wait;
 };
 
 struct tun_sock;
@@ -331,7 +330,7 @@ static void tun_net_uninit(struct net_device *dev)
 	/* Inform the methods they need to stop using the dev.
 	 */
 	if (tfile) {
-		wake_up_all(&tfile->read_wait);
+		wake_up_all(&tun->socket.wait);
 		if (atomic_dec_and_test(&tfile->count))
 			__tun_detach(tun);
 	}
@@ -398,7 +397,7 @@ static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Notify and wake up reader process */
 	if (tun->flags & TUN_FASYNC)
 		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
-	wake_up_interruptible(&tun->tfile->read_wait);
+	wake_up_interruptible(&tun->socket.wait);
 	return 0;
 
 drop:
@@ -495,7 +494,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
 
 	DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name);
 
-	poll_wait(file, &tfile->read_wait, wait);
+	poll_wait(file, ...
From: David Miller
Date: Monday, April 20, 2009 - 2:26 am

From: Herbert Xu <herbert@gondor.apana.org.au>

Do you think these two patches are ready to go into net-2.6
now?

Thanks.
--

From: Herbert Xu
Date: Monday, April 20, 2009 - 2:35 am

I think so.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Monday, April 20, 2009 - 3:02 am

From: Herbert Xu <herbert@gondor.apana.org.au>

Great, applied, thanks Herbert.
--

From: Christian Borntraeger
Date: Friday, April 24, 2009 - 1:55 am

Sorry for the late reply, but this patch creates another regression:
Now TUNSETIFF returns EBUSY on a persistent tap device:

open("/dev/net/tun", O_RDWR) = 11
ioctl(11, 0x400454ca, 0x3ffff8e2270) = -1 EBUSY (Device or resource busy)

Some debugging (thanks to Carsten Otte) showed that

tun_set_iff calls tun_attach (the first call inside the if(dev)). tun_attach
now checks for tun->tfile but this is already set.
Looks like we need another fix on top :-(

Christian
--

From: Herbert Xu
Date: Friday, April 24, 2009 - 5:11 am

The patch you qouted has been superceded by many versions :)

Can you please test the latest that's in davem's tree?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

Previous thread: Let Accai Berry imporve your health by Tanya on Tuesday, April 14, 2009 - 12:52 pm. (1 message)

Next thread: Acai Super Berry Capsules, you will love by Marcia Shafer on Tuesday, April 14, 2009 - 11:48 am. (1 message)