Device passthrough technology allows a guest to bypass the hypervisor and drive the underlying physical device. VMware has been exploring various ways to deliver this technology to users in a manner which is easy to adopt. In this process we have prepared an architecture along with Intel - NPA (Network Plugin Architecture). NPA allows the guest to use the virtualized NIC vmxnet3 to passthrough to a number of physical NICs which support it. The document below provides an overview of NPA. We intend to upgrade the upstreamed vmxnet3 driver to implement NPA so that Linux users can exploit the benefits provided by passthrough devices in a seamless manner while retaining the benefits of virtualization. The document below tries to answer most of the questions which we anticipated. Please let us know your comments and queries. Thank you. Signed-off-by: Pankaj Thakkar <pthakkar@vmware.com> Network Plugin Architecture --------------------------- VMware has been working on various device passthrough technologies for the past few years. Passthrough technology is interesting as it can result in better performance/cpu utilization for certain demanding applications. In our vSphere product we support direct assignment of PCI devices like networking adapters to a guest virtual machine. This allows the guest to drive the device using the device drivers installed inside the guest. This is similar to the way KVM allows for passthrough of PCI devices to the guests. The hypervisor is bypassed for all I/O and control operations and hence it can not provide any value add features such as live migration, suspend/resume, etc. Network Plugin Architecture (NPA) is an approach which VMware has developed in joint partnership with Intel which allows us to retain the best of passthrough technology and virtualization. NPA allows for passthrough of the fast data (I/O) path and lets the hypervisor deal with the slow control path using traditional emulation/paravirtualization techniques. Through this ...
On Tue, 4 May 2010 16:02:25 -0700 Code please. Also, it has to work for all architectures not just VMware and Intel. --
The purpose of this email is to introduce the architecture and the design principles. The overall project involves more than just changes to vmxnet3 driver and hence we though an overview email would be better. Once people agree to the design in general we intend to provide the code changes to the vmxnet3 driver. The architecture supports more than Intel NICs. We started the project with Intel but plan to support all major IHVs including Broadcom, Qlogic, Emulex and others through a certification program. The architecture works on VMware ESX server only as it requires significant support from the hypervisor. Also, the vmxnet3 driver works on VMware platform only. AFAICT Xen has a different model for supporting SR-IOV devices and allowing live migration and the document briefly talks about it (paragraph 6). Thanks, -pankaj --
From: Pankaj Thakkar <pthakkar@vmware.com> Stephen's point is that code talks and bullshit walks. Talk about high level designs rarely gets any traction, and often goes nowhere. Give us an example implementation so there is something concrete for us to sink our teeth into. --
Sure. We have been working on NPA for a while and have the code internally up and running. Let me sync up internally on how and when we can provide the vmxnet3 driver code so that people can look at it. --
On Tue, 4 May 2010 17:18:57 -0700 As Dave said, we care more about what the implementation looks like than the high level goals of the design. I think we all agree that better management of virtualized devices is necessary, the problem is that their are so many of them (vmware, xen, HV, Xen), and vendors seem to to lean on their own specific implementation of a offloading, which makes a general solution more difficult. Please, Please solve this cleanly. The little things like API's and locking semantics and handling of dynamic versus static control can make a good design in principle fall apart when someone does a bad job of implementing them. Lastly, projects that have had multiple people involved for long periods of time in the dark often end up building a legacy mentality "but we convinced vendor XXX to include it in their Enterprise version 666" and require lots of "retraining" before the code becomes acceptable. -- --
How does the throughput, latency, and host CPU utilization for normal data path compare with say NetQueue? How many cards actually support this NPA interface? What does it look like, i.e. where is the NPA specification? (AFAIK, we never got the UPT How do you handle hardware which has a more symmetric view of the SR-IOV world (SR-IOV is only PCI sepcification, not a network driver specification)? Or hardware which has multiple functions per physical This can happen without NPA as well. VF simply needs to request the change via the PF (in fact, hw does that right now). Also, we already have a host side management interface via PF (see, for example, RTM_SETLINK IFLA_VF_MAC interface). So we have a plugin per hardware VF implementation? And the hypervisor Yes, this is important, esp. instead of the requirement for hw to implement a specific interface (I suspect you know all about this issue And it will need to be GPL AFAICT from what you've said thus far. It does sound worrisome, although I suppose hw firmware isn't particularly Please make this shell API interface and the PF/VF requirments available. thanks, -chris --
NetQueue is really for scaling across multiple VMs. NPA allows similar scaling and also helps in improving the CPU efficiency for a single VM since the hypervisor is bypassed. Througput wise both emulation and passthrough (NPA) can obtain line rates on 10gig but passthrough saves upto 40% cpu based on the workload. We did a demo at IDF 2009 where we compared 8 VMs running on NetQueue v/s 8 VMs running on NPA (using Niantic) and we obtained similar CPU efficiency NPA and UPT share a lot of code in the hypervisor. UPT was adopted only by very We have it working internally with Intel Niantic (10G) and Kawela (1G) SR-IOV NIC. We are also working with upcoming Broadcom 10G card and plan to support other IHVs. This is unlike UPT so we don't dictate the register sets or rings like we did in UPT. Rather we have guidelines like that the card should have an embedded switch for inter VF switching or should support programming (rx I am not sure what do you mean by symmetric view of SR-IOV world? NPA allows multi-queue VFs and requires an embedded switch currently. As far as the PF driver is concerned we require IHVs to support all existing and upcoming features like NetQueue, FCoE, etc. The PF driver is considered special and is used to drive the traffic for the emulated/paravirtualized VMs and is also used to program things on behalf of the VFs through the hypervisor. If the hardware has multiple physical functions they are treated as separate adapters (with their own set of VFs) and we require the embedded switch to maintain that The setup is 2.667Ghz Nehalem server running SLES11 VM talking to a 2.33Ghz Barcelona client box running RHEL 5.1. We had netperf streams with 16k msg size over 64k socket size running between server VM and client and they are using Intel Niantic 10G cards. In both cases (NPA and regular) the VM was CPU saturated (used one full core). TX: regular vmxnet3 = 3085.5 Mbps/GHz; NPA vmxnet3 = 4397.2 Mbps/GHz RX: regular vmxnet3 = 1379.6 Mbps/GHz; NPA vmxnet3 = ...
We're not going to add any kind of loader for binry blobs into kernel space, sorry. Don't even bother wasting your time on this. --
The mechanism described in the document is loading a binary blob coded to an abstract API. That's something entirely different from having normal modules for the Virtual Functions, which we already have for various pieces of hardware anyway. --
Yes, with the exception that the only body of code that will be accepted by the shell should be GPL-licensed and thus open and available for examining. This is not different from having a standard kernel module that is loaded normally and plugs into a certain subsystem. The difference is that the binary resides not on guest filesystem -- Dmitry --
[PT] Today this is tied to vmxnet3 device and is intended to work on ESX hypervisor only (vmxnet3 works on VMware hypervisor only). All the loading support is inside the ESX hypervisor. I am going to post the interface between the shell and the plugin soon and you can see that there is not a whole lot of dependency or infrastructure requirements from the Linux kernel. Please keep in mind that we don't use Linux as a hypervisor but as a guest VM. --
We have the right number of module loaders in the kernel: one. If you add another one, you're doubling the amount of code that anyone Your approach assumes that the plugin is always available, which has If you have the limited driver for some hardware that does not have the real thing, we could still ship just that. I would however guess that most vendors are interested in not just running in vmware but also other hypervisors that still require the full driver, so that case would be rare, especially in the long run. Arnd --
Since plugin[s] are carried by the host they are indeed always available. -- Dmitry --
But what makes you think that you can build code that can be linked into arbitrary future kernel versions? The kernel does not define any calling conventions that are stable across multiple versions or configurations. For example, you'd have to provide different binaries for each combination of - 32/64 bit code - gcc -mregparm=? - lockdep - tracepoints - stackcheck - NOMMU - highmem - whatever new gets merged If you build the plugins only for specific versions of "enterprise" Linux kernels, the code becomes really hard to debug and maintain. If you wrap everything in your own version of the existing interfaces, your code gets bloated to the point of being unmaintainable. So I have to correct myself: this is very different from assuming the driver is available in the guest, it's actually much worse. Arnd --
The plugin image is not linked against Linux kernel. It is OS agnostic infact (Eg. same plugin works for Linux and Windows VMs) Plugin is built against the shell API interface. It is loaded by hypervisor in a set of pages provided by shell. Guest OS specific tasks (like allocation of pages for plugin to load) are handled by shell and this is the one which will be upstreamed in Linux kernel. Maintenance of shell is the same as for any other driver currently existing in Linux kernel. --
Overhead of interpreting bytecode plugin is written in. Or are you saying plugin is x86 assembly (32bit or 64bit btw?) and other arches will have to have in kernel x86 emulator to use the plugin (like some of them had for vgabios)? -- Gleb. --
Plugin is x86 or x64 machine code. You write the plugin in C and compile it using gcc/ld to get the object file, we map the relevant sections only to the OS space. NPA is a way of enabling passthrough of SR-IOV NICs with live migration support on ESX Hypervisor which runs only on x86/x64 hardware. It only supports x86/x64 guest OS. So we don't have to worry about other architectures. If NPA approach needs to be extended and adopted by other hypervisors then we have to take care of that. Today we have two plugins images per VF (one for 32-bit, one for 64-bit). --
Which is simply not supportable for a cross-platform operating system like Linux. --
We only support in-kernel drivers, everything else is subject to changes in the kernel API and ABI. What you do is basically introducing another wrapper layer not allowing full access to the normal Linux API. People have tried this before and we're not willing to add it. Do a little And that's not something we care about at all. The Linux kernel has traditionally a very hostile position against cross platform drivers for Yes, of course it does. It's a normal driver at the point which it But we use Linux as the hypervisor, too. So if you want to target a major infrastructure you might better make it available for that case. --
On Wed, 5 May 2010 13:39:51 -0400 Let me put it bluntly. Any design that allows external code to run in the kernel is not going to be accepted. Out of tree kernel modules are enough of a pain already, why do you expect the developers to add another interface. --
Exactly. Until our friends at VMware get this basic fact it's useless to continue arguing. Pankaj and Dmitry: you're fine to waste your time on this, but it's not going to go anywhere until you address that fundamental problem. The first thing you need to fix in your archicture is to integrate the VF function code into the kernel tree, and we can work from there. Please post patches doing this if you want to resume the discussion. --
As discussed, following is the patch to give you an idea about implementation of NPA for vmxnet3 driver. Although the patch is big, I have verified it with checkpatch.pl. It gave 0 errors / warnings. Signed-off-by: Matthieu Bucchaineri <matthieu@vmware.com> Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com> --- drivers/net/vmxnet3/Makefile | 2 drivers/net/vmxnet3/npa_defs.h | 83 + drivers/net/vmxnet3/npa_plugin_api.h | 473 ++++++++ drivers/net/vmxnet3/npa_shell_api.h | 234 ++++ drivers/net/vmxnet3/vmxnet3_defs.h | 2 drivers/net/vmxnet3/vmxnet3_drv.c | 1845 +++++++++++++++++++-------------- drivers/net/vmxnet3/vmxnet3_ethtool.c | 66 + drivers/net/vmxnet3/vmxnet3_int.h | 221 ++-- drivers/net/vmxnet3/vmxnet3_plugin.c | 1221 ++++++++++++++++++++++ 9 files changed, 3221 insertions(+), 926 deletions(-) create mode 100644 drivers/net/vmxnet3/npa_defs.h create mode 100644 drivers/net/vmxnet3/npa_plugin_api.h create mode 100644 drivers/net/vmxnet3/npa_shell_api.h create mode 100644 drivers/net/vmxnet3/vmxnet3_plugin.c diff --git a/drivers/net/vmxnet3/Makefile b/drivers/net/vmxnet3/Makefile index 880f509..af501d8 100644 --- a/drivers/net/vmxnet3/Makefile +++ b/drivers/net/vmxnet3/Makefile @@ -32,4 +32,4 @@ obj-$(CONFIG_VMXNET3) += vmxnet3.o -vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o +vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o vmxnet3_plugin.o diff --git a/drivers/net/vmxnet3/npa_defs.h b/drivers/net/vmxnet3/npa_defs.h new file mode 100644 index 0000000..74d28b8 --- /dev/null +++ b/drivers/net/vmxnet3/npa_defs.h @@ -0,0 +1,83 @@ +/* + * Network Plugin Architecture definitions. + * + * Copyright (C) 2008-2010, VMware, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; version 2 of the License and no later version. + * + * ...
On Mon, 12 Jul 2010 20:06:28 -0700 I think the concept won't fly. But you should really at least try running checkpatch to make sure the style conforms. -- --
On Mon, 12 Jul 2010 20:06:28 -0700 I am surprised, the code seems to use lots of mixed case in places that don't really follow current kernel practice. --
Your patch is line-wrapped and can not be applied :( Care to fix your email client? Is there some reason that our in-kernel functions that do this type of logic are not working for you to require you to reimplement this? thanks, greg k-h --
Greg, Thanks for pointing out. I will fix both these issues and repost the patch. ->Shreyas --
The plugin is guest agnostic and hence we did not want to rely on any kernel provided functions. The plugin uses only the interface provided by the shell. The assumption is that since the plugin is really simple and straight forward (all the control/init complexity lies in the PF driver in the hypervisor) we should be able to get by for most of the things and for things like memcpy/memset the plugin can write simple functions like this. -p ________________________________________ From: Greg KH [greg@kroah.com] Sent: Wednesday, July 14, 2010 2:49 AM To: Shreyas Bhatewara Cc: Christoph Hellwig; Stephen Hemminger; Pankaj Thakkar; pv-drivers@vmware.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; virtualization@lists.linux-foundation.org Subject: Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3 Is there some reason that our in-kernel functions that do this type of logic are not working for you to require you to reimplement this? thanks, greg k-h--
From: Pankaj Thakkar <pthakkar@vmware.com> While I disagree entirely with this kind of approach, even that doesn't justify what you're doing here. memcpy() and memset() are on a much more fundamental ground than "kernel provided functions". They had better be available no matter where you build this thing. And doing what you're doing is foolish on so many levels. One more duplication of code, one more place for unnecessary bugs to live, one more place that might need optimizations and thus require duplication of even more work people have done over the years. --
Not to mention calling a function "MoveMemory" when it doesn't do a
memmove is just cruel.
J
--
Really? vmxnet3_plugin.c is no supposed to use any kernel-provided functions at all? Then why have it in the kernel at all? Seriously, If it's so simple, then why does it need to be separate? Why not just put it in your driver as-is to handle the ring-buffer logic (as that's all it looks to be doing), and then you don't need any plugin code at all? It looks like you are linking this file into your "main" driver module, so I fail to see any type of separation at all happening with this patch. Or am I totally missing something here? thanks, greg k-h --
Reposting the patch with the fixes. --- From: Shreyas Bhatewara <sbhatewara@vmware.com> Patch to enable NPA support in vmxnet3 driver. Signed-off-by: Matthieu Bucchaineri <matthieu@vmware.com> Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com> --- drivers/net/vmxnet3/Makefile | 2 drivers/net/vmxnet3/npa_defs.h | 83 + drivers/net/vmxnet3/npa_plugin_api.h | 473 ++++++++ drivers/net/vmxnet3/npa_shell_api.h | 234 ++++ drivers/net/vmxnet3/vmxnet3_defs.h | 2 drivers/net/vmxnet3/vmxnet3_drv.c | 1841 +++++++++++++++++++-------------- drivers/net/vmxnet3/vmxnet3_ethtool.c | 66 + drivers/net/vmxnet3/vmxnet3_int.h | 221 ++-- drivers/net/vmxnet3/vmxnet3_plugin.c | 1199 +++++++++++++++++++++ 9 files changed, 3195 insertions(+), 926 deletions(-) create mode 100644 drivers/net/vmxnet3/npa_defs.h create mode 100644 drivers/net/vmxnet3/npa_plugin_api.h create mode 100644 drivers/net/vmxnet3/npa_shell_api.h create mode 100644 drivers/net/vmxnet3/vmxnet3_plugin.c diff --git a/drivers/net/vmxnet3/Makefile b/drivers/net/vmxnet3/Makefile index 880f509..af501d8 100644 --- a/drivers/net/vmxnet3/Makefile +++ b/drivers/net/vmxnet3/Makefile @@ -32,4 +32,4 @@ obj-$(CONFIG_VMXNET3) += vmxnet3.o -vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o +vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o vmxnet3_plugin.o diff --git a/drivers/net/vmxnet3/npa_defs.h b/drivers/net/vmxnet3/npa_defs.h new file mode 100644 index 0000000..74d28b8 --- /dev/null +++ b/drivers/net/vmxnet3/npa_defs.h @@ -0,0 +1,83 @@ +/* + * Linux driver for VMware's vmxnet3 ethernet NIC. + * + * Copyright (C) 2008-2009, VMware, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; version 2 of the License and no later version. + * + * This program is distributed in the hope that it will be ...
Why would the kernel care about this file path? And since when do we hard-code file paths in the kernel in the first place (yeah, in some This is happily copied around and zeroed out, but never actually used by This field is never used. This hiding of functions kind of implies that something odd is going on here, right? At the least, make them inline functions so you get the This will never work, sorry. Please use the proper functions for doing this type of access. I'm amazed that anyone even thought this would What's wrong with the kernel provided function for this? Anyway, just randomly poking at the code like this turns up these types of trivial issues, has this code ever been run? wierd, greg k-h --
Is this enforced? Since you pass the hardware through, you can't rely This is essentially a miniature network stack with a its own mini bonding layer, mini hotplug, and mini API, except s/API/ABI/. Is this a correct view? If so, the Linuxy approach would be to use the ordinary drivers and the Linux networking API, and hide the bond setup using namespaces. The bond driver, or perhaps a new, similar, driver can be enhanced to propagate ethtool commands to its (hidden) components, and to have a control channel with the hypervisor. This would make the approach hypervisor agnostic, you're just pairing So the Shell would be the reworked or new bond driver, and Plugins would be ordinary Linux network drivers. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
We don't pass the whole VF to the guest. Only the BAR which is responsible for
TX/RX/intr is mapped into guest space. The interface between the shell and
plugin only allows to do operations related to TX and RX such as send a packet
to the VF, allocate RX buffers, indicate a packet upto the shell. All control
operations are handled by the shell and the shell does what the existing
vmxnet3 drivers does (touch a specific register and let the device emulation do
the work). When a VF is mapped to the guest the hypervisor knows this and
programs the h/w accordingly on behalf of the shell. So for example if the VM
does a MAC address change inside the guest, the shell would write to
VMXNET3_REG_MAC{L|H} registers which would trigger the device emulation to read
the new mac address and update its internal virtual port information for the
virtual switch and if the VF is mapped it would also program the embedded
To some extent yes but there is no complicated bonding nor there is any thing
like a PCI hotplug. The shell interface is small and the OS always interacts
with the shell as the main driver. Based on the underlying VF the plugin
changes and this plugin as well is really small. Our vmxnet3 s/w plugin is
about 1300 lines with whitespaces and comments and the Intel Kawela plugin is
about 1100 lines with whitspaces and comments. The design principle is to put
more of the complexity related to initialization/control into the PF driver
In NPA we do not rely on the guest OS to provide any of these services like
bonding or PCI hotplug. We don't rely on the guest OS to unmap a VF and switch
a VM out of passthrough. In a bonding approach that becomes an issue you can't
just yank a device from underneath, you have to wait for the OS to process the
request and switch from using VF to the emulated device and this makes the
hypervisor dependent on the guest OS. Also we don't rely on the presence of all
the drivers inside the guest OS (be it Linux or Windows), the ESX hypervisor
carries all the ...Well the Shell does some sort of bonding (there are two links and the shell selects which one to exercise) and some sort of hotplug. Since the Shell is part of the guest OS, you do rely on it. How can you unmap the VF without guest cooperation? If you're executing Plugin code, you can't yank anything out. What ISAs do those plugins support? -- error compiling committee.c: too many arguments to function --
No. This is a guideline which we provided to IHVs and would have to be enforced In our Kawela plugin we don't have any reads from the memory space at all. Hence you can yank the VF anytime (the code loaded in the guest address space will keep on executing). Even if there were reads we can map the memory pages to a NULL page and return 0xffffffff so that the plugin can detect this and return an error to the shell. Remember there are no control operations in the plugin and the code is really small (about 1k lines compared to 5k lines in Depends on the model. Today the plugin code for checking the TX/RX rings runs x86 and x64. Thanks, -pankaj --
