Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.

Previous thread: [RFC PATCH v8 02/16] Add a new struct for device to manipulate external buffer. by xiaohui.xin on Thursday, July 29, 2010 - 4:14 am. (14 messages)

Next thread: [PATCH] ecryptfs (repost): release reference to lower mount if interpose fails by Lino Sanfilippo on Thursday, July 29, 2010 - 4:01 am. (2 messages)
From: xiaohui.xin
Date: Thursday, July 29, 2010 - 4:14 am

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-10:  	net core and kernel changes.
patch 11-13:  	new device as interface to mantpulate external buffers.
patch 14: 	for vhost-net.
patch 15:	An example on modifying NIC driver to using napi_gro_frags().
patch 16:	An example how to get guest buffers based on driver
		who using napi_gro_frags().

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later.

What we have not done yet:
	Performance tuning

what we have done in v1:
	polish the RCU usage
	deal with write logging in asynchroush mode in vhost
	add ...
From: xiaohui.xin
Date: Thursday, July 29, 2010 - 4:14 am

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..74af06c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,15 @@ struct skb_shared_info {
 	void *		destructor_arg;
 };
 
+/* The structure is for a skb which pages may point to
+ * an external buffer, which is not allocated from kernel space.
+ * It also contains a destructor for itself.
+ */
+struct skb_ext_page {
+	struct		page *page;
+	void		(*dtor)(struct skb_ext_page *);
+};
+
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.  A clone of a headerless skb holds the length of
-- 
1.5.4.4

--

From: Shirley Ma
Date: Thursday, July 29, 2010 - 3:31 pm

Hello Xiaohui,


Since vhost-net already supports macvtap/tun backends, do you think
whether it's better to implement zero copy in macvtap/tun than inducing

I did some vhost performance measurement over 10Gb ixgbe, and found that
in order to get consistent BW results, netperf/netserver, qemu, vhost
threads smp affinities are required.

Looking forward to these results for small message size comparison. For
large message size 10Gb ixgbe BW already reached by doing vhost smp
affinity w/i offloading support, we will see how much CPU utilization it
can be reduced. 

Please provide latency results as well. I did some experimental on
macvtap zero copy sendmsg, what I have found that get_user_pages latency
pretty high.

Thanks
Shirley




--

From: Avi Kivity
Date: Thursday, July 29, 2010 - 10:02 pm

get_user_pages() is indeed slow.  But what about get_user_pages_fast()?

Note that when the page is first touched, get_user_pages_fast() falls 
back to get_user_pages(), so the latency needs to be measured after 
quite a bit of warm-up.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Shirley Ma
Date: Friday, July 30, 2010 - 8:46 am

Hello Avi,


Yes, I used get_user_pages_fast, however if falled back to
get_user_pages() when the apps doesn't allocate buffer on the same page.
If I run a single ping, the RTT is extremely high, but when running
multiple pings, the RTT time reduce significantly, but still it is not
as fast as copy from my initial test. I am thinking that we might need
to pre-pin memory pool.

Shirley

--

From: Avi Kivity
Date: Sunday, August 1, 2010 - 1:18 am

I don't understand.  Under what conditions do you use get_user_pages() 
instead of get_user_pages_fast()?  Why?

-- 
error compiling committee.c: too many arguments to function

--

From: Shirley Ma
Date: Monday, August 2, 2010 - 9:01 am

Hello Avi,


The code always calls get_user_pages_fast, however, the page will be
unpinned in skb_free if the same page is not used again for a new
buffer. The reason for unpin the page is we don't want to pin all of the
guest kernel memory(memory over commit). So get_user_pages_fast will
call slow path get_user_pages. 

Your previous comment is suggesting to keep the page pinned for
get_user_pages_fast fast path?

Thanks
Shirley

--

From: Avi Kivity
Date: Monday, August 2, 2010 - 9:11 am

I don't understand this. gup_fast() only calls gup() if the page is 

Right now I'm not sure I understand what's happening.

-- 
error compiling committee.c: too many arguments to function

--

From: Shirley Ma
Date: Monday, August 2, 2010 - 9:25 am

Oh, I used the page as read-only on xmit path. Should I use write
instead?

Thanks
Shirley

--

From: Avi Kivity
Date: Monday, August 2, 2010 - 9:32 am

No, for xmit getting the page as read only is fine.

I was inaccurate, gup_fast() performs as follows:

- if .write = 1, gup_fast() will be fast if the page is mapped and writeable
- if .write = 0, gup_fast() will be fast if the page is mapped

so, using .write = 0 for the xmit path will be faster in more cases than 
.write = 1.

When are you seeing gup_fast() fall back to gup()?  It should be at most 
once per page (when a guest starts up none of its pages are mapped, it 
faults them in on demand).

-- 
error compiling committee.c: too many arguments to function

--

From: Shirley Ma
Date: Tuesday, August 10, 2010 - 8:28 pm

Hello Avi,


netperf/netserver latency results are pretty good for message size
between 1 bytes and 512 bytes when I have 64 bytes small copy.

However if I don't have any small copy , the ping RTT time is
unreasonable huge. Since we think it's better to have small message with
copy, so there will be no issue.

Thanks
Shirley

--

From: Xin, Xiaohui
Date: Friday, July 30, 2010 - 1:53 am

From: Shirley Ma
Date: Friday, July 30, 2010 - 8:51 am

Hello Xiaohui,


I think it should be less duplicated code in the kernel if we use
macvtap to support what media passthrough driver here. Since macvtap has
support virtio_net head and offloading already, the only missing func is
zero copy. Also QEMU supports macvtap, we just need add a zero copy flag
in option.

Thanks
Shirley

--

From: Arnd Bergmann
Date: Saturday, July 31, 2010 - 2:30 am

Yes, I fully agree and that was one of the intended directions for
macvtap to start with. Thank you so much for following up on that,
I've long been planning to work on macvtap zero-copy myself but it's
now lower on my priorities, so it's good to hear that you made progress
on it, even if there are still performance issues.

	Arnd
--

From: Dong, Eddie
Date: Tuesday, August 3, 2010 - 7:06 pm

But zero-copy is a Linux generic feature that can be used by other VMMs as well if the BE service drivers want to incorporate.  If we can make mp device VMM-agnostic (it may be not yet in current patch), that will help Linux more.


Thx, Eddie--

From: Arnd Bergmann
Date: Wednesday, August 4, 2010 - 1:56 am

But the tun/tap protocol is what most hypervisors use today on Linux,
and one of the design goals of macvtap was to keep that interface
so that everyone gets the features like zero-copy if that is added
to macvtap. The mp device interface is currently not supported by
anything else than vhost with these patches, and making it more
generic would turn the interface into a copy of macvtap.

	Arnd
--

From: Shirley Ma
Date: Wednesday, August 4, 2010 - 10:09 am

Hello Eddie,


First other VMMs support tun/tap which provides most funcs but not zero
copy.

Second mp patch only supports zero copy for vhost now, macvtap zero copy
will not be used by vhost only.

Third, the current mp device doesn't fallback to copy when failure.

So you can extend mp device to support all funcs, but the usage/funcs
will be similar with macvtap, then either mp device will replace macvtap
in the future with similar funcs or we can use enhance macvtap to
support zero copy. It is not necessary to have both in Linux.

I think it's better to implement zero copy in macvtap, tun/tap instead
of creating a new mp device.

Thanks
Shirley

--

From: Michael S. Tsirkin
Date: Sunday, August 1, 2010 - 1:31 am

Could you provide an example of a good setup?
Specifically, is it a good idea for the vhost thread

I think we should explore the idea for the driver to fall back on data copy
for small message sizes.
--

From: Shirley Ma
Date: Monday, August 2, 2010 - 9:04 am

Hello Michael,


Yes, we used to have 128 bytes for small copy in other driver. I saw
Xiaohui's patch here is using 64 bytes. I think we need to compare the
performance on different platform to decide what's the best for small
message size.

Thanks
Shirley

--

From: Shirley Ma
Date: Monday, August 2, 2010 - 9:10 am

Hello Michael,


I need to retest my set up with multi-threads vhost. My previous set up
applies to single thread vhost. The single stream netperf/netserver set
up, for example, if we have two quad-cores sockets to get the consistent
9.4Gb/s BW:

socket 1:
cpu0: netperf/netserver
cpu1: ixgbe 10GbE NIC IRQ 
cpu2: I/O thread
cpu3: vhost thread

socket 2:
cpu0: QEMU VCPU0
cpu1: QEMU VCPU1
cpu2:
cpu3:

Thanks
Shirley

--

From: Xin, Xiaohui
Date: Tuesday, August 3, 2010 - 1:48 am

May you share me with your performance results (including BW and latency)on 
vhost-net and how you get them(your configuration and especially with the affinity 
settings)?

Thanks

--

From: Shirley Ma
Date: Tuesday, August 3, 2010 - 8:50 am

Hello Xiaohui,


My macvtap zero copy is incomplete, I am testing sendmsg only now. The
initial performance is not good especially for latency (zero copy vs.
copy). I am still working on it to find out why and how to improve.
That's the reason I am eager to know your performance results and how
much performance gain you have seen.

Since your patch has completed. I would try your patch here for
performance. If you have some performance results to share here that
would be great.

Thanks
Shirley

--

From: Xin, Xiaohui
Date: Thursday, August 5, 2010 - 1:52 am

Herbert,
The v8 patches are modified mostly based on your comments about
napi_gro_frags interface. How do you think about the patches about
net core system part?
We know currently there are some comments about the mp device,
such as to support zero-copy for tun/tap and macvtap. Since there 
isn't a decision yet about it. May you give comments about the 
net core system first, since this part is all the same for zero-copy.

Thanks
--

Previous thread: [RFC PATCH v8 02/16] Add a new struct for device to manipulate external buffer. by xiaohui.xin on Thursday, July 29, 2010 - 4:14 am. (14 messages)

Next thread: [PATCH] ecryptfs (repost): release reference to lower mount if interpose fails by Lino Sanfilippo on Thursday, July 29, 2010 - 4:01 am. (2 messages)