Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

Previous thread: [PATCHv7 1/3] tun: export underlying socket by Michael S. Tsirkin on Tuesday, November 3, 2009 - 10:24 am. (2 messages)

Next thread: 2.6.32-rc5-mmotm1101 - kernel BUG at net/ipv4/tcp_input.c:3707! by Valdis.Kletnieks on Tuesday, November 3, 2009 - 10:50 am. (17 messages)
From: Michael S. Tsirkin
Date: Tuesday, November 3, 2009 - 10:24 am

What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
  migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device.  Backend is also configured by userspace, including vlan/mac
etc.

Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.

Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use

Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 MAINTAINERS                |    9 +
 arch/x86/kvm/Kconfig       |    1 +
 drivers/Makefile           |    1 +
 drivers/vhost/Kconfig      |   11 +
 drivers/vhost/Makefile     |    2 +
 drivers/vhost/net.c        |  633 +++++++++++++++++++++++++++++
 drivers/vhost/vhost.c      |  970 ++++++++++++++++++++++++++++++++++++++++++++
 ...
From: Eric Dumazet
Date: Tuesday, November 3, 2009 - 11:03 am

using rcu_dereference() and mutex_lock() at the same time seems wrong, I suspect
that your use of RCU is not correct.

1) rcu_dereference() should be done inside a read_rcu_lock() section, and
   we are not allowed to sleep in such a section.
   (Quoting Documentation/RCU/whatisRCU.txt :
     It is illegal to block while in an RCU read-side critical section, )

2) mutex_lock() can sleep (ie block)

--

From: Gregory Haskins
Date: Tuesday, November 3, 2009 - 11:08 am

Michael,
  I warned you that this needed better documentation ;)

Eric,
  I think I flagged this once before, but Michael convinced me that it
was indeed "ok", if but perhaps a bit unconventional.  I will try to
find the thread.

Kind Regards,
-Greg

From: Gregory Haskins
Date: Tuesday, November 3, 2009 - 11:14 am

Here it is:

http://lkml.org/lkml/2009/8/12/173

Kind Regards,
-Greg

From: Eric Dumazet
Date: Tuesday, November 3, 2009 - 11:51 am

Yes, this doesnt convince me at all, and could be a precedent for a wrong RCU use.
People wanting to use RCU do a grep on kernel sources to find how to correctly
use RCU.

Michael, please use existing locking/barrier mechanisms, and not pretend to use RCU.

Some automatic tools might barf later.

For example, we could add a debugging facility to check that rcu_dereference() is used
in an appropriate context, ie conflict with existing mutex_lock() debugging facility.


--

From: Gregory Haskins
Date: Tuesday, November 3, 2009 - 12:50 pm

Yes, I would tend to agree with you.  In fact, I think I suggested that
a normal barrier should be used instead of abusing rcu_dereference().

But as far as his code is concerned, I think it technically works
properly, and that was my main point.  Also note that the usage
rcu_dereference+mutex_lock() are not necessarily broken, per se:  it
could be an srcu-based critical section created by the caller, for
instance.  It would be perfectly legal to sleep on the mutex if that
were the case.

To me, the bigger issue is that the rcu_dereference() without any
apparent hint of a corresponding RSCS is simply confusing as a reviewer.
 smp_rmb() (or whatever is proper in this case) is probably more
appropriate.

Kind Regards,
-Greg


From: Michael S. Tsirkin
Date: Tuesday, November 3, 2009 - 12:58 pm

Paul, you acked this previously. Should I add you acked-by line so
people calm down?  If you would rather I replace
rcu_dereference/rcu_assign_pointer with rmb/wmb, I can do this.
Or maybe patch Documentation to explain this RCU usage?

-- 
MST
--

From: Eric Dumazet
Date: Tuesday, November 3, 2009 - 2:11 pm

So you believe I am over-reacting to this dubious use of RCU ?

RCU documentation is already very complex, we dont need to add yet another
subtle use, and makes it less readable.

It seems you use 'RCU api' in drivers/vhost/net.c as convenient macros :

#define rcu_dereference(p)     ({ \
                                typeof(p) _________p1 = ACCESS_ONCE(p); \
                                smp_read_barrier_depends(); \
                                (_________p1); \
                                })

#define rcu_assign_pointer(p, v) \
        ({ \
                if (!__builtin_constant_p(v) || \
                    ((v) != NULL)) \
                        smp_wmb(); \
                (p) = (v); \
        })


There are plenty regular uses of smp_wmb() in kernel, not related to Read Copy Update,
there is nothing wrong to use barriers with appropriate comments.

(And you already use mb(), wmb(), rmb(), smp_wmb() in your patch)


BTW there is at least one locking bug in vhost_net_set_features()

Apparently, mutex_unlock() doesnt trigger a fault if mutex is not locked
by current thread... even with DEBUG_MUTEXES / DEBUG_LOCK_ALLOC


static void vhost_net_set_features(struct vhost_net *n, u64 features)
{
       size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ?
               sizeof(struct virtio_net_hdr) : 0;
       int i;
<<!>>  mutex_unlock(&n->dev.mutex);
       n->dev.acked_features = features;
       smp_wmb();
       for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
               mutex_lock(&n->vqs[i].mutex);
               n->vqs[i].hdr_size = hdr_size;
               mutex_unlock(&n->vqs[i].mutex);
       }
       mutex_unlock(&n->dev.mutex);
       vhost_net_flush(n);
}
--

From: Michael S. Tsirkin
Date: Wednesday, November 4, 2009 - 5:04 am

Well, what I do has classic RCU characteristics: readers do not take
locks, writers take a lock and flush after update. This is why I believe
rcu_dereference and rcu_assign_pointer are more appropriate here than
open-coding barriers.

Before deciding whether it's a good idea to open-code barriers

Yes, virtio guest pretty much forces this, there's no way to share

Thanks very much for spotting this! Will fix.

-- 
MST
--

From: Paul E. McKenney
Date: Tuesday, November 3, 2009 - 4:57 pm

What was happening in that case was that the rcu_dereference()
was being used in a workqueue item.  The role of rcu_read_lock()
was taken on be the start of execution of the workqueue item, of
rcu_read_unlock() by the end of execution of the workqueue item, and
of synchronize_rcu() by flush_workqueue().  This does work, at least
assuming that flush_workqueue() operates as advertised, which it appears
to at first glance.

The above code looks somewhat different, however -- I don't see
handle_tx() being executed in the context of a work queue.  Instead
it appears to be in an interrupt handler.

So what is the story?  Using synchronize_irq() or some such?

							Thanx, Paul
--

From: Michael S. Tsirkin
Date: Wednesday, November 4, 2009 - 4:57 am

No, there has been no change (I won't be able to use a mutex in an
interrupt handler, will I?).  handle_tx is still called in the context
of a work queue: either from handle_tx_kick or from handle_tx_net which
are work queue items.

Can you ack this usage please?

-- 
MST
--

From: Paul E. McKenney
Date: Wednesday, November 4, 2009 - 10:25 am

Ah, my mistake -- I was looking at 2.6.31 rather than latest git with

I thought I had done so in my paragraph above, but if you would like
something a bit more formal...

	I, Paul E. McKenney, maintainer of the RCU implmentation
	embodied in the Linux kernel and co-inventor of RCU, being of
	sound mind and body, notwithstanding the wear and tear inherent
	in my numerous decades sojourn on this planet, hereby declare
	that the following usage of work queues constitutes a valid
	RCU implementation:

	1.	Execution of a full workqueue item being substituted
		for a conventional RCU read-side critical section, so
		that the start of execution of the function specified to
		INIT_WORK() corresponds to rcu_read_lock(), and the end of
		this self-same function corresponds to rcu_read_unlock().

	2.	Execution of flush_workqueue() being substituted for
		the conventional synchronize_rcu().

	The kernel developer availing himself or herself of this
	declaration must observe the following caveats:

	a.	The function specified to INIT_WORK() may only be
		invoked via the workqueue mechanism.  Invoking said
		function directly renders this declaration null
		and void, as it prevents the flush_workqueue() function
		from delivering the fundamental guarantee inherent in RCU.

	b.	At some point in the future, said developer may be
		required to apply some gcc attribute or sparse annotation
		to the function passed to INIT_WORK().	Beyond that
		point, failure to comply will render this declaration
		null and void, as such failure would render inoperative
		some potential RCU-validation tools, as duly noted by
		Eric Dumazet.

	c.	This declaration in no way relieves the developer of
		the responsibility to use this and other synchronization
		mechanisms correctly, again, as duly noted by Eric
		Dumazet.

(Sorry, but, as always, I could not resist!)

							Thanx, Paul
--

From: Eric Dumazet
Date: Wednesday, November 4, 2009 - 10:33 am

Yesssss :)

Thanks Paul for this masterpiece of diplomatic "Acked-by" ;)



--

From: Michael S. Tsirkin
Date: Wednesday, November 4, 2009 - 12:06 pm

Thanks Paul!
Jonathan: are you reading this?
Another one for your quotes of the week collection :)

-- 
MST
--

From: Gregory Haskins
Date: Wednesday, November 4, 2009 - 12:12 pm

I second that.

From: Rusty Russell
Date: Thursday, November 5, 2009 - 10:01 pm

<snip verbose super-ack with qualifications>

That's great guys.  And yes, this is a kind of read-copy-update.  And no,
there's nothing wrong with it.

But it's still nasty to use half an API.  If it were a few places I would
have open-coded it with a comment, or wrapped it.  As it is, I don't think
that would be a win.

Cheers,
Rusty.
--

From: Paul E. McKenney
Date: Friday, November 6, 2009 - 9:30 am

So would it help to have a rcu_read_lock_workqueue() and
rcu_read_unlock_workqueue() that checked nesting and whether they were
actually running in the context of a workqueue item?  Or did you have
something else in mind?  Or am I misjudging the level of sarcasm in
your reply?  ;-)

							Thanx, Paul
--

From: Rusty Russell
Date: Saturday, November 7, 2009 - 9:09 pm

You read correctly.  If we get a second user, creating an API makes sense.

Thanks,
Rusty.
--

From: Paul E. McKenney
Date: Sunday, November 8, 2009 - 12:36 pm

Makes sense to me as well.  Which does provide some time to come up with
a primitive designed to answer the question "Am I currently executing in
the context of a workqueue item?".  ;-)

							Thanx, Paul
--

From: Michael S. Tsirkin
Date: Tuesday, November 3, 2009 - 12:55 pm

This use is correct. See comment in vhost.h This use of RCU has been
acked by Paul E. McKenney (paulmck@linux.vnet.ibm.com) as well.
There are many ways to use RCU not all of which involve read_rcu_lock.

--

From: Andi Kleen
Date: Wednesday, November 4, 2009 - 4:08 am

"Michael S. Tsirkin" <mst@redhat.com> writes:


I haven't gone over all this code in detail, but that isolated reference count
use looks suspicious. What prevents the mm from going away before
you increment, if it's not the current one?

-Andi 

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Michael S. Tsirkin
Date: Wednesday, November 4, 2009 - 5:10 am

We take a reference to it before we start any virtqueues,
and stop all virtqueues before we drop the reference:
/* Caller should have device mutex */
static long vhost_dev_set_owner(struct vhost_dev *dev)
{
        /* Is there an owner already? */
        if (dev->mm)
                return -EBUSY;
        /* No owner, become one */
        dev->mm = get_task_mm(current);
        return 0;
}

And 
vhost_dev_cleanup:
....

        if (dev->mm)
                mmput(dev->mm);
        dev->mm = NULL;
}


--

From: Andi Kleen
Date: Wednesday, November 4, 2009 - 5:59 am

> Fine?

I cannot say -- are there paths that could drop the device beforehand?
(as in do you hold a reference to it?)

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Michael S. Tsirkin
Date: Wednesday, November 4, 2009 - 6:08 am

By design I think I always have a reference to mm before I use it.

This works like this:
ioctl SET_OWNER - calls get_task_mm, I think this gets a reference to mm
ioctl SET_BACKEND - checks that SET_OWNER was run, starts virtqueue
ioctl RESET_OWNER - stops virtqueues, drops the reference to mm
file close - stops virtqueues, if we still have it then drops mm

This is why I think I can call use_mm/unuse_mm while virtqueue is running,
safely.
--

From: Andi Kleen
Date: Wednesday, November 4, 2009 - 6:15 am

Do you protect against another thread doing RESET_OWNER in parallel while
RESET_OWNER runs?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Michael S. Tsirkin
Date: Wednesday, November 4, 2009 - 6:17 am

The device is created when file is open and destroyed
when file is closed. So I think the fs code handles the
reference counting for me: it won't call file cleanup
callback while some userspace process has the file open.

--

From: Andi Kleen
Date: Wednesday, November 4, 2009 - 6:37 am

Yes.

But the semantics when someone inherits such a fd through exec
or through file descriptor passing would be surely "interesting"
You would still do IO on the old VM.

I guess it would be a good way to confuse memory accounting schemes 
or administrators @)

It would be all saner if this was all a single atomic step.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Michael S. Tsirkin
Date: Wednesday, November 4, 2009 - 6:41 am

I have this atomic actually. A child process will first thing
do SET_OWNER: this is required before any other operation.

SET_OWNER atomically (under mutex) does two things:
- check that there is no other owner
- get mm and set current process as owner

I hope this addresses your concern?

-- 
MST
--

From: Michael S. Tsirkin
Date: Wednesday, November 4, 2009 - 9:37 am

Andrea, since you looked at this design at the early stages,
maybe you can provide feedback on the following question:

vhost has an ioctl to do get_task_mm and store the mm in per-file device
structure.  mmput is called when file is closed.  vhost is careful not
to reference the mm after is has been put. There is also an atomic
mutual exclusion mechanism to ensure that vhost does not allow one
process to access another's mm, even if they share a vhost file
descriptor.  But, this still means that mm structure can outlive the
task if the file descriptor is shared with another process.

Other drivers, such as kvm, have the same property.

Do you think this is OK?

Thanks!

-- 
MST
--

Previous thread: [PATCHv7 1/3] tun: export underlying socket by Michael S. Tsirkin on Tuesday, November 3, 2009 - 10:24 am. (2 messages)

Next thread: 2.6.32-rc5-mmotm1101 - kernel BUG at net/ipv4/tcp_input.c:3707! by Valdis.Kletnieks on Tuesday, November 3, 2009 - 10:50 am. (17 messages)