We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10: net core and kernel changes. patch 11-13: new device as interface to mantpulate external buffers. patch 14: for vhost-net. patch 15: An example on modifying NIC driver to using napi_gro_frags(). patch 16: An example how to get guest buffers based on driver who using napi_gro_frags(). The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)->frags, and copied the header to skb->data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add ...
From: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
include/linux/skbuff.h | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..74af06c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,15 @@ struct skb_shared_info {
void * destructor_arg;
};
+/* The structure is for a skb which pages may point to
+ * an external buffer, which is not allocated from kernel space.
+ * It also contains a destructor for itself.
+ */
+struct skb_ext_page {
+ struct page *page;
+ void (*dtor)(struct skb_ext_page *);
+};
+
/* We divide dataref into two halves. The higher 16 bits hold references
* to the payload part of skb->data. The lower 16 bits hold references to
* the entire skb->data. A clone of a headerless skb holds the length of
--
1.5.4.4
--
Hello Xiaohui, Since vhost-net already supports macvtap/tun backends, do you think whether it's better to implement zero copy in macvtap/tun than inducing I did some vhost performance measurement over 10Gb ixgbe, and found that in order to get consistent BW results, netperf/netserver, qemu, vhost threads smp affinities are required. Looking forward to these results for small message size comparison. For large message size 10Gb ixgbe BW already reached by doing vhost smp affinity w/i offloading support, we will see how much CPU utilization it can be reduced. Please provide latency results as well. I did some experimental on macvtap zero copy sendmsg, what I have found that get_user_pages latency pretty high. Thanks Shirley --
get_user_pages() is indeed slow. But what about get_user_pages_fast()? Note that when the page is first touched, get_user_pages_fast() falls back to get_user_pages(), so the latency needs to be measured after quite a bit of warm-up. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Hello Avi, Yes, I used get_user_pages_fast, however if falled back to get_user_pages() when the apps doesn't allocate buffer on the same page. If I run a single ping, the RTT is extremely high, but when running multiple pings, the RTT time reduce significantly, but still it is not as fast as copy from my initial test. I am thinking that we might need to pre-pin memory pool. Shirley --
I don't understand. Under what conditions do you use get_user_pages() instead of get_user_pages_fast()? Why? -- error compiling committee.c: too many arguments to function --
Hello Avi, The code always calls get_user_pages_fast, however, the page will be unpinned in skb_free if the same page is not used again for a new buffer. The reason for unpin the page is we don't want to pin all of the guest kernel memory(memory over commit). So get_user_pages_fast will call slow path get_user_pages. Your previous comment is suggesting to keep the page pinned for get_user_pages_fast fast path? Thanks Shirley --
I don't understand this. gup_fast() only calls gup() if the page is Right now I'm not sure I understand what's happening. -- error compiling committee.c: too many arguments to function --
Oh, I used the page as read-only on xmit path. Should I use write instead? Thanks Shirley --
No, for xmit getting the page as read only is fine. I was inaccurate, gup_fast() performs as follows: - if .write = 1, gup_fast() will be fast if the page is mapped and writeable - if .write = 0, gup_fast() will be fast if the page is mapped so, using .write = 0 for the xmit path will be faster in more cases than .write = 1. When are you seeing gup_fast() fall back to gup()? It should be at most once per page (when a guest starts up none of its pages are mapped, it faults them in on demand). -- error compiling committee.c: too many arguments to function --
Hello Avi, netperf/netserver latency results are pretty good for message size between 1 bytes and 512 bytes when I have 64 bytes small copy. However if I don't have any small copy , the ping RTT time is unreasonable huge. Since we think it's better to have small message with copy, so there will be no issue. Thanks Shirley --
Hello Xiaohui, I think it should be less duplicated code in the kernel if we use macvtap to support what media passthrough driver here. Since macvtap has support virtio_net head and offloading already, the only missing func is zero copy. Also QEMU supports macvtap, we just need add a zero copy flag in option. Thanks Shirley --
Yes, I fully agree and that was one of the intended directions for macvtap to start with. Thank you so much for following up on that, I've long been planning to work on macvtap zero-copy myself but it's now lower on my priorities, so it's good to hear that you made progress on it, even if there are still performance issues. Arnd --
But zero-copy is a Linux generic feature that can be used by other VMMs as well if the BE service drivers want to incorporate. If we can make mp device VMM-agnostic (it may be not yet in current patch), that will help Linux more. Thx, Eddie--
But the tun/tap protocol is what most hypervisors use today on Linux, and one of the design goals of macvtap was to keep that interface so that everyone gets the features like zero-copy if that is added to macvtap. The mp device interface is currently not supported by anything else than vhost with these patches, and making it more generic would turn the interface into a copy of macvtap. Arnd --
Hello Eddie, First other VMMs support tun/tap which provides most funcs but not zero copy. Second mp patch only supports zero copy for vhost now, macvtap zero copy will not be used by vhost only. Third, the current mp device doesn't fallback to copy when failure. So you can extend mp device to support all funcs, but the usage/funcs will be similar with macvtap, then either mp device will replace macvtap in the future with similar funcs or we can use enhance macvtap to support zero copy. It is not necessary to have both in Linux. I think it's better to implement zero copy in macvtap, tun/tap instead of creating a new mp device. Thanks Shirley --
Could you provide an example of a good setup? Specifically, is it a good idea for the vhost thread I think we should explore the idea for the driver to fall back on data copy for small message sizes. --
Hello Michael, Yes, we used to have 128 bytes for small copy in other driver. I saw Xiaohui's patch here is using 64 bytes. I think we need to compare the performance on different platform to decide what's the best for small message size. Thanks Shirley --
Hello Michael, I need to retest my set up with multi-threads vhost. My previous set up applies to single thread vhost. The single stream netperf/netserver set up, for example, if we have two quad-cores sockets to get the consistent 9.4Gb/s BW: socket 1: cpu0: netperf/netserver cpu1: ixgbe 10GbE NIC IRQ cpu2: I/O thread cpu3: vhost thread socket 2: cpu0: QEMU VCPU0 cpu1: QEMU VCPU1 cpu2: cpu3: Thanks Shirley --
May you share me with your performance results (including BW and latency)on vhost-net and how you get them(your configuration and especially with the affinity settings)? Thanks --
Hello Xiaohui, My macvtap zero copy is incomplete, I am testing sendmsg only now. The initial performance is not good especially for latency (zero copy vs. copy). I am still working on it to find out why and how to improve. That's the reason I am eager to know your performance results and how much performance gain you have seen. Since your patch has completed. I would try your patch here for performance. If you have some performance results to share here that would be great. Thanks Shirley --
Herbert, The v8 patches are modified mostly based on your comments about napi_gro_frags interface. How do you think about the patches about net core system part? We know currently there are some comments about the mp device, such as to support zero-copy for tun/tap and macvtap. Since there isn't a decision yet about it. May you give comments about the net core system first, since this part is all the same for zero-copy. Thanks --
