Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

Previous thread: [PATCH] Fix missing of last user while dumping slab corruption log by ShiYong LI on Friday, April 2, 2010 - 12:21 am. (5 messages)

Next thread: [RFC] [PATCH v2 1/3] A device for zero-copy based on KVM virtio-net. by xiaohui.xin on Friday, April 2, 2010 - 12:27 am. (1 message)
From: xiaohui.xin
Date: Friday, April 2, 2010 - 12:25 am

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

The scenario is like this:

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

Here, we have ever considered 2 ways to utilize the page constructor
API to dispense the user buffers.

One:	Modify __alloc_skb() function a bit, it can only allocate a 
	structure of sk_buff, and the data pointer is pointing to a 
	user buffer which is coming from a page constructor API.
	Then the shinfo of the skb is also from guest.
	When packet is received from hardware, the skb->data is filled
	directly by h/w. What we have done is in this way.

	Pros:	We can avoid any copy here.
	Cons:	Guest virtio-net driver needs to allocate skb as almost
		the same method with the host NIC drivers, say the size
		of netdev_alloc_skb() and the same reserved space in the
		head of skb. Many NIC drivers are the same with guest and
		ok for this. But some lastest NIC drivers reserves special
		room in skb head. To deal with it, we suggest to ...
From: Sridhar Samudrala
Date: Friday, April 2, 2010 - 4:51 pm

What is the advantage of this approach compared to PCI-passthrough
of the host NIC to the guest?
Does this require pinning of the entire guest memory? Or only the
send/receive buffers?

Thanks

--

From: Avi Kivity
Date: Saturday, April 3, 2010 - 9:32 am

swapping/ksm/etc
independence from host hardware

If done correctly, just the send/receive buffers.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Xin, Xiaohui
Date: Monday, April 5, 2010 - 11:06 pm

PCI-passthrough needs hardware support, a kind of iommu engine will
help to translate guest physical address to host physical address.
And currently, a PCI-passthrough device cannot pass live migration.

The zero-copy is a pure software solution. It doesn't need special hardware support.

We need only to pin the send/receive buffers.

Thanks

--

From: Michael S. Tsirkin
Date: Wednesday, April 14, 2010 - 8:25 am

Unfortunately, this would break compatibility with existing virtio.

The obvious question would be whether you see any speed difference
with the two approaches. If no, then the second approach would be


--

From: Xin, Xiaohui
Date: Thursday, April 15, 2010 - 2:36 am

You mean any modification to the guest virtio-net driver will break the
compatibility? We tried to enlarge the virtio_net_config to contains the
2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe()
will check the feature flag, and get the parameters, then virtio-net driver use
I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to

I remember the second approach is a bit slower in 1500MTU. 
We can support 1500MTU, but for jumbo frame, since vhost driver before don't 
support mergeable buffer, we cannot try it for multiple sg. A jumbo frame will split 5
frags and hook them once a descriptor, so the user buffer allocation is greatly dependent
Actually, I think if the mergeable buffer may get good performance, then GRO is not 
--

From: Michael S. Tsirkin
Date: Thursday, April 15, 2010 - 3:05 am

This means that we can't, for example, live-migrate between different systems

Looking at code, this seems to do with alignment - could just be

Well, that's an important datapoint. By the way, you'll need
header copy to activate LRO in host, so that's a good

I do not see why, vhost currently supports 64K buffers with indirect

My guess would be yes. Mergeable buffers is a memory saving
optimization, not a performance optimization, I don't see
that it can help. And I think you can't solely rely on jumbo frames
in hardware, not everyone can enable them.

Having said that, number one priority is getting decent performance
out of the driver, in whatever way you find fit. I was just
--

From: Xin, Xiaohui
Date: Monday, April 19, 2010 - 3:05 am

Ok. What we have thought about now is to do something with skb_reserve().


--

From: Michael S. Tsirkin
Date: Monday, April 19, 2010 - 3:21 am

From: Xin, Xiaohui
Date: Monday, April 19, 2010 - 7:21 pm

I don't mean this, it's for buffer submission. I mean when packet is received, in receive_buf(), mergeable buffer knows which pages received can be hooked in skb frags, it's receive_mergeable() which do this.

When a NIC driver supports packet split mode, then each ring descriptor contains a skb and a page. When packet is received, if the status is not EOP, then hook the page of the next descriptor to the prev skb. We don't how many frags belongs to one skb. So when guest submit buffers, it should submit multiple pages, and when receive, the guest should know which pages are belongs to one skb and hook them together. I think receive_mergeable() can do this, but I don't see how big->packets handle this. May I miss something here?

Thanks
Xiaohui 

--

From: Michael S. Tsirkin
Date: Wednesday, April 21, 2010 - 1:35 am

Yes, I think this packet split mode probably maps well to mergeable buffer
support. Note that
1. Not all devices support large packets in this way, others might map
   to indirect buffers better
   So we have to figure out how migration is going to work
2. It's up to guest driver whether to enable features such as
   mergeable buffers and indirect buffers
   So we have to figure out how to notify guest which mode
   is optimal for a given device
3. We don't want to depend on jumbo frames for decent performance
   So we probably should support GSO/GRO

-- 
MST
--

From: Xin, Xiaohui
Date: Thursday, April 22, 2010 - 1:57 am

Yes, different guest virtio-net driver may contain different features.
Does the qemu migration work with different features supported by virtio-net
Yes. When a device is binded, the mp device may query the capabilities from driver.
Actually, there is a structure now in mp device can do this, we can add some field
GSO is for the tx side, right? I think driver can handle it itself.
For GRO, I'm not sure it's easy or not. Basically, the mp device now
we have support is doing what raw socket is doing. The packets are not going to host stack.
-- 
MST
--

From: Michael S. Tsirkin
Date: Thursday, April 22, 2010 - 2:19 am

For now, you must have identical feature-sets for migration to work.
And long as we manage the buffers in software, we can always make

See commit bfd5f4a3d605e0f6054df0b59fe0907ff7e696d3
(it doesn't currently work with vhost net, but that's
--

Previous thread: [PATCH] Fix missing of last user while dumping slab corruption log by ShiYong LI on Friday, April 2, 2010 - 12:21 am. (5 messages)

Next thread: [RFC] [PATCH v2 1/3] A device for zero-copy based on KVM virtio-net. by xiaohui.xin on Friday, April 2, 2010 - 12:27 am. (1 message)