login
Login
/
Register
Search
Search this site:
Forums
News
Blogs
Features
Site
Home
»
Mailing list archives
»
linux-kernel
»
2010
»
April
»
19
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
view
thread
Previous message: [
thread
] [
date
] [
author
]
Next message: [
thread
] [
date
] [
author
]
[view in full thread]
From: Michael S. Tsirkin
Subject:
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
Date: Monday, April 19, 2010 - 3:21 am
On Mon, Apr 19, 2010 at 06:05:17PM +0800, Xin, Xiaohui wrote:
quoted text
> > Michael, > > >>> The idea is simple, just to pin the guest VM user space and then > > >>> let host NIC driver has the chance to directly DMA to it. > > >>> The patches are based on vhost-net backend driver. We add a device > > >>> which provides proto_ops as sendmsg/recvmsg to vhost-net to > > >>> send/recv directly to/from the NIC driver. KVM guest who use the > > >>> vhost-net backend may bind any ethX interface in the host side to > > >>> get copyless data transfer thru guest virtio-net frontend. > > >>> > > >>> The scenario is like this: > > >>> > > >>> The guest virtio-net driver submits multiple requests thru vhost-net > > >>> backend driver to the kernel. And the requests are queued and then > > >>> completed after corresponding actions in h/w are done. > > >>> > > >>> For read, user space buffers are dispensed to NIC driver for rx when > > >>> a page constructor API is invoked. Means NICs can allocate user buffers > > >>> from a page constructor. We add a hook in netif_receive_skb() function > > >>> to intercept the incoming packets, and notify the zero-copy device. > > >>> > > >>> For write, the zero-copy deivce may allocates a new host skb and puts > > >>> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data. > > >>> The request remains pending until the skb is transmitted by h/w. > > >>> > > >>> Here, we have ever considered 2 ways to utilize the page constructor > > >>> API to dispense the user buffers. > > >>> > > >>> One: Modify __alloc_skb() function a bit, it can only allocate a > > >>> structure of sk_buff, and the data pointer is pointing to a > > >>> user buffer which is coming from a page constructor API. > > >>> Then the shinfo of the skb is also from guest. > > >>> When packet is received from hardware, the skb->data is filled > > >>> directly by h/w. What we have done is in this way. > > >>> > > >>> Pros: We can avoid any copy here. > > >>> Cons: Guest virtio-net driver needs to allocate skb as almost > > >>> the same method with the host NIC drivers, say the size > > >>> of netdev_alloc_skb() and the same reserved space in the > > >>> head of skb. Many NIC drivers are the same with guest and > > >>> ok for this. But some lastest NIC drivers reserves special > > >>> room in skb head. To deal with it, we suggest to provide > > >>> a method in guest virtio-net driver to ask for parameter > > >>> we interest from the NIC driver when we know which device > > >>> we have bind to do zero-copy. Then we ask guest to do so. > > >>> Is that reasonable? > > >>Unfortunately, this would break compatibility with existing virtio. > > >>This also complicates migration. > >> You mean any modification to the guest virtio-net driver will break the > >> compatibility? We tried to enlarge the virtio_net_config to contains the > >> 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe() > >> will check the feature flag, and get the parameters, then virtio-net driver use > >> it to allocate buffers. How about this? > > >This means that we can't, for example, live-migrate between different systems > >without flushing outstanding buffers. > > Ok. What we have thought about now is to do something with skb_reserve(). > If the device is binded by mp, then skb_reserve() will do nothing with it. > > > >>What is the room in skb head used for? > > >I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to > >> NET_IP_ALIGN. > > >Looking at code, this seems to do with alignment - could just be > >a performance optimization. > > > >>> Two: Modify driver to get user buffer allocated from a page constructor > > >>> API(to substitute alloc_page()), the user buffer are used as payload > > >>> buffers and filled by h/w directly when packet is received. Driver > > >>> should associate the pages with skb (skb_shinfo(skb)->frags). For > > >>> the head buffer side, let host allocates skb, and h/w fills it. > > >>> After that, the data filled in host skb header will be copied into > > >>> guest header buffer which is submitted together with the payload buffer. > > >>> > > >>> Pros: We could less care the way how guest or host allocates their > > >>> buffers. > > >>> Cons: We still need a bit copy here for the skb header. > > >>> > > >>> We are not sure which way is the better here. > > >>The obvious question would be whether you see any speed difference > > >>with the two approaches. If no, then the second approach would be > > >>better. > > > >> I remember the second approach is a bit slower in 1500MTU. > >> But we did not tested too much. > > >Well, that's an important datapoint. By the way, you'll need > >header copy to activate LRO in host, so that's a good > >reason to go with option 2 as well. > > > > >>> This is the first thing we want > > >>> to get comments from the community. We wish the modification to the network > > >>> part will be generic which not used by vhost-net backend only, but a user > > >>> application may use it as well when the zero-copy device may provides async > > >>> read/write operations later. > > >>> > > >>> Please give comments especially for the network part modifications. > > >>> > > >>> > > >>> We provide multiple submits and asynchronous notifiicaton to > > >>>vhost-net too. > > >>> > > >>> Our goal is to improve the bandwidth and reduce the CPU usage. > > >>> Exact performance data will be provided later. But for simple > > >>> test with netperf, we found bindwidth up and CPU % up too, > > >>> but the bindwidth up ratio is much more than CPU % up ratio. > > >>> > > >>> What we have not done yet: > > >>> packet split support > > > > >>What does this mean, exactly? > >> We can support 1500MTU, but for jumbo frame, since vhost driver before don't > > >support mergeable buffer, we cannot try it for multiple sg. > > >I do not see why, vhost currently supports 64K buffers with indirect > >descriptors. > > The receive_skb() in guest virtio-net driver will merge the multiple sg to skb frags, how can indirect descriptors to that?
See add_recvbuf_big.
quoted text
> >>> A jumbo frame will split 5 > >>> frags and hook them once a descriptor, so the user buffer allocation is greatly dependent > >>> on how guest virtio-net drivers submits buffers. We think mergeable buffer is suitable for >>>it. > > > > >> To support GRO > >>> Actually, I think if the mergeable buffer may get good performance, then GRO is not > >>> so important then. > > >>And TSO/GSO? > >>> Do we really need them? > > >>My guess would be yes. Mergeable buffers is a memory saving > >>optimization, not a performance optimization, I don't see > >>that it can help. And I think you can't solely rely on jumbo frames > >>in hardware, not everyone can enable them. > > >Having said that, number one priority is getting decent performance > >out of the driver, in whatever way you find fit. I was just > >suggesting obvious ways to do this. > > Thanks. > > > >> Performance tuning > > >> > > >> what we have done in v1: > > >> polish the RCU usage > > >> deal with write logging in asynchroush mode in vhost > > >> add notifier block for mp device > > >> rename page_ctor to mp_port in netdevice.h to make it looks generic > > >> add mp_dev_change_flags() for mp device to change NIC state > > >> add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load > > >> a small fix for missing dev_put when fail > > >> using dynamic minor instead of static minor number > > >> a __KERNEL__ protect to mp_get_sock() > > >> > > >> what we have done in v2: > > >> > > >> remove most of the RCU usage, since the ctor pointer is only > > >> changed by BIND/UNBIND ioctl, and during that time, NIC will be > > >> stopped to get good cleanup(all outstanding requests are finished), > > >> so the ctor pointer cannot be raced into wrong situation. > > >> > > >> Remove the struct vhost_notifier with struct kiocb. > > >> Let vhost-net backend to alloc/free the kiocb and transfer them > > >> via sendmsg/recvmsg. > > >> > > >> use get_user_pages_fast() and set_page_dirty_lock() when read. > > >> > > >> Add some comments for netdev_mp_port_prep() and handle_mpassthru(). > > >> > > >> > > >> Comments not addressed yet in this time: > > >> the async write logging is not satified by vhost-net > > >> Qemu needs a sync write > > >> a limit for locked pages from get_user_pages_fast() > > >> > > >> > > >> performance: > > >> using netperf with GSO/TSO disabled, 10G NIC, > > >> disabled packet split mode, with raw socket case compared to vhost. > > >> > > >> bindwidth will be from 1.1Gbps to 1.7Gbps > > >> CPU % from 120%-140% to 140%-160%
--
unsubscribe notice
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to
majordomo@vger.kernel.org
More majordomo info at
http://vger.kernel.org/majordomo-info.html
Please read the FAQ at
http://www.tux.org/lkml/
Previous message: [
thread
] [
date
] [
author
]
Next message: [
thread
] [
date
] [
author
]
Messages in current thread:
[RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virt ...
, xiaohui.xin
, (Fri Apr 2, 12:25 am)
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Sridhar Samudrala
, (Fri Apr 2, 4:51 pm)
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Avi Kivity
, (Sat Apr 3, 9:32 am)
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Xin, Xiaohui
, (Mon Apr 5, 11:06 pm)
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Michael S. Tsirkin
, (Wed Apr 14, 8:25 am)
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Xin, Xiaohui
, (Thu Apr 15, 2:36 am)
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Michael S. Tsirkin
, (Thu Apr 15, 3:05 am)
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Xin, Xiaohui
, (Mon Apr 19, 3:05 am)
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Michael S. Tsirkin
, (Mon Apr 19, 3:21 am)
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Xin, Xiaohui
, (Mon Apr 19, 7:21 pm)
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Michael S. Tsirkin
, (Wed Apr 21, 1:35 am)
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Xin, Xiaohui
, (Thu Apr 22, 1:57 am)
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM ...
, Michael S. Tsirkin
, (Thu Apr 22, 2:19 am)
Navigation
Mailing list archives
Recent posts
Popular discussions
linux-kernel
:
Paul Turner
[tg_shares_up rewrite v4 11/11] sched: update tg->shares after cpu.shares write
Matthew Garrett
Re: [PATCH] Enable speedstep for sonoma processors.
Mauro Carvalho Chehab
Re: [PATCH 1/2] media: Add timberdale video-in driver
Peter Zijlstra
[PATCH 23/30] netvm: skb processing
Greg Kroah-Hartman
[PATCH 21/28] cgroupfs: create /sys/fs/cgroup to mount cgroupfs on
git
:
Jan Hudec
Re: GIT push to sftp (feature request)
Steffen Prohaska
[PATCH 0/4] core.ignorecase
Johannes Schindelin
Re: Git checkout preserve timestamp?
Linus Torvalds
[PATCH 1/7] Make unpack_trees_options bit flags actual bitfields
Johan Herland
Re: What's cooking in git.git (Oct 2010, #01; Wed, 13)
linux-netdev
:
David Miller
Re: [PATCH 1/3] f_phonet: dev_kfree_skb instead of dev_kfree_skb_any in TX callback
Richard Cochran
Re: [PATCH v3 3/3] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
Jan Engelhardt
Re: [PATCH] Fix netfilter xt_time's time_mt()'s use of do_div()
Herbert Xu
Re: [RFC PATCH 00/17] virtual-bus
Jeff Kirsher
Re: [net-next-2.6 PATCH] e1000e: don't inadvertently re-set INTX_DISABLE
git-commits-head
:
Linux Kernel Mailing List
ALSA: hda - Enable beep on Realtek codecs with PCI SSID override
Linux Kernel Mailing List
Use path_put() in a few places instead of {mnt,d}put()
Linux Kernel Mailing List
mv643xx_eth: use sw csum for big packets
Linux Kernel Mailing List
arm: fix HAVE_CLK merge goof
Linux Kernel Mailing List
arm: convert pcm037 platform to use smsc911x
freebsd-current
:
David Wolfskill
"interrupt storm..."; seems associated with an0 NIC
Andriy Gapon
Re: letting glabel recognise a media change
Garrett Cooper
Re: Only display ACPI bootmenu key if ACPI is present
Pyun YongHyeon
CFT: msk(4) Rx checksum offloading support
FreeBSD Tinderbox
[head tinderbox] failure on sparc64/sparc64
Colocation donated by:
Syndicate