The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. The scenario is like this: The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)->frags, and copied the header to skb->data. The request remains pending until the skb is transmitted by h/w. Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One: Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb->data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to ...
What is the advantage of this approach compared to PCI-passthrough of the host NIC to the guest? Does this require pinning of the entire guest memory? Or only the send/receive buffers? Thanks --
swapping/ksm/etc independence from host hardware If done correctly, just the send/receive buffers. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
PCI-passthrough needs hardware support, a kind of iommu engine will help to translate guest physical address to host physical address. And currently, a PCI-passthrough device cannot pass live migration. The zero-copy is a pure software solution. It doesn't need special hardware support. We need only to pin the send/receive buffers. Thanks --
Unfortunately, this would break compatibility with existing virtio. The obvious question would be whether you see any speed difference with the two approaches. If no, then the second approach would be --
You mean any modification to the guest virtio-net driver will break the compatibility? We tried to enlarge the virtio_net_config to contains the 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe() will check the feature flag, and get the parameters, then virtio-net driver use I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to I remember the second approach is a bit slower in 1500MTU. We can support 1500MTU, but for jumbo frame, since vhost driver before don't support mergeable buffer, we cannot try it for multiple sg. A jumbo frame will split 5 frags and hook them once a descriptor, so the user buffer allocation is greatly dependent Actually, I think if the mergeable buffer may get good performance, then GRO is not --
This means that we can't, for example, live-migrate between different systems Looking at code, this seems to do with alignment - could just be Well, that's an important datapoint. By the way, you'll need header copy to activate LRO in host, so that's a good I do not see why, vhost currently supports 64K buffers with indirect My guess would be yes. Mergeable buffers is a memory saving optimization, not a performance optimization, I don't see that it can help. And I think you can't solely rely on jumbo frames in hardware, not everyone can enable them. Having said that, number one priority is getting decent performance out of the driver, in whatever way you find fit. I was just --
Ok. What we have thought about now is to do something with skb_reserve(). --
I don't mean this, it's for buffer submission. I mean when packet is received, in receive_buf(), mergeable buffer knows which pages received can be hooked in skb frags, it's receive_mergeable() which do this. When a NIC driver supports packet split mode, then each ring descriptor contains a skb and a page. When packet is received, if the status is not EOP, then hook the page of the next descriptor to the prev skb. We don't how many frags belongs to one skb. So when guest submit buffers, it should submit multiple pages, and when receive, the guest should know which pages are belongs to one skb and hook them together. I think receive_mergeable() can do this, but I don't see how big->packets handle this. May I miss something here? Thanks Xiaohui --
Yes, I think this packet split mode probably maps well to mergeable buffer support. Note that 1. Not all devices support large packets in this way, others might map to indirect buffers better So we have to figure out how migration is going to work 2. It's up to guest driver whether to enable features such as mergeable buffers and indirect buffers So we have to figure out how to notify guest which mode is optimal for a given device 3. We don't want to depend on jumbo frames for decent performance So we probably should support GSO/GRO -- MST --
Yes, different guest virtio-net driver may contain different features. Does the qemu migration work with different features supported by virtio-net Yes. When a device is binded, the mp device may query the capabilities from driver. Actually, there is a structure now in mp device can do this, we can add some field GSO is for the tx side, right? I think driver can handle it itself. For GRO, I'm not sure it's easy or not. Basically, the mp device now we have support is doing what raw socket is doing. The packets are not going to host stack. -- MST --
For now, you must have identical feature-sets for migration to work. And long as we manage the buffers in software, we can always make See commit bfd5f4a3d605e0f6054df0b59fe0907ff7e696d3 (it doesn't currently work with vhost net, but that's --
