Re: [RFC PATCH 00/17] virtual-bus

Previous thread: none

Next thread: [PATCH 1/2 V2] kaweth: Fix locking to be SMP-safe by Larry Finger on Tuesday, March 31, 2009 - 11:45 am. (7 messages)
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:42 am

applies to v2.6.29 (will port to git HEAD soon)

FIRST OFF: Let me state that this is not a KVM or networking specific
technology.  Virtual-Bus is a mechanism for defining and deploying
software “devices” directly in a Linux kernel.  The example use-case we
have provided supports a “virtual-ethernet” device being utilized in a
KVM guest environment, so comparisons to virtio-net will be natural.
However, please note that this is but one use-case, of many we have
planned for the future (such as userspace bypass and RT guest support).
The goal for right now is to describe what a virual-bus is and why we
believe it is useful.

We are intent to get this core technology merged, even if the networking
components are not accepted as is.  It should be noted that, in many ways,
virtio could be considered complimentary to the technology.  We could
in fact, have implemented the virtual-ethernet using a virtio-ring, but
it would have required ABI changes that we didn't want to yet propose
without having the concept in general vetted and accepted by the community.

To cut to the chase, we recently measured our virtual-ethernet on 
v2.6.29 on two 8-core x86_64 boxes with Chelsio T3 10GE connected back
to back via cross over.  We measured bare-metal performance, as well
as a kvm guest (running the same kernel) connected to the T3 via
a linux-bridge+tap configuration with a 1500 MTU.  The results are as
follows:

Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)

As you can see, all three technologies can achieve (MTU limited) line-rate,
but the virtio-net solution is severely limited on the latency front (by a
factor of 48:1)

Note that the 320pps is technically artificially low in virtio-net, caused by a
a known design limitation to use a timer for tx-mitigation.  However, note that
even when removing the timer from the path the best we could achieve ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:42 am

This interface provides a bidirectional shared-memory based signaling
mechanism.  It can be used by any entities which desire efficient
communication via shared memory.  The implementation details of the
signaling are abstracted so that they may transcend a wide variety
of locale boundaries (e.g. userspace/kernel, guest/host, etc).

The shm_signal mechanism supports event masking as well as spurious
event delivery mitigation.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/shm_signal.h |  188 ++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig                |   10 ++
 lib/Makefile               |    1 
 lib/shm_signal.c           |  186 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 385 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/shm_signal.h
 create mode 100644 lib/shm_signal.c

diff --git a/include/linux/shm_signal.h b/include/linux/shm_signal.h
new file mode 100644
index 0000000..a65e54e
--- /dev/null
+++ b/include/linux/shm_signal.h
@@ -0,0 +1,188 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_SHM_SIGNAL_H
+#define _LINUX_SHM_SIGNAL_H
+
+#include <asm/types.h>
+
+/*
+ *---------
+ * The following structures represent data that is shared ...
From: Avi Kivity
Date: Tuesday, March 31, 2009 - 1:44 pm

Similarly, this should be padded to 0 (mod 8).

Instead of versions, I prefer feature flags which can be independently 

This means "->inject() has been called from the other side"?

(reading below I see this is so.  not used to reading well commented 


When you overlay a ring on top of this, won't the ring indexes convey 
the same information as ->dirty?


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 1:58 pm

Yeah, good idea.  What is the official way to do this these days?  Are

Totally agreed.  If you look, most of the ABI has a type of "NEGCAP"
(negotiate capabilities) feature.  The version number is a contingency
plan in case I still have to break it for whatever reason.   I will
always opt for the feature bits over bumping the version when its


This is still somewhat of a immature part of the design.  Its supposed
to be used so that by default, its a panic.  But on the host side, we
can do something like inject a machine-check.  That way malicious/broken
guests cannot (should not? ;) be able to take down the host.  Note today
I do not map this to anything other than the default panic, so this
needs some love.

But given the asynchronous nature of the fault, I want to be sure we
have decent accounting to avoid bug reports like "silent MCE kills the
guest" ;)  At least this way, we can log the fault string somewhere to

I agree that the information may be redundant with components of the
broader shm state.  However, we need this state at this level of scope
in order to function optimally, so I dont think its a huge deal to have
this here as well.  Afterall, the shm_signal library can only assess its
internal state.  We would have to teach it how to glean the broader
state through some mechanism otherwise (callback, perhaps), but I don't
think its worth it.


From: Avi Kivity
Date: Tuesday, March 31, 2009 - 2:05 pm

I see.

This raises a point I've been thinking of - the symmetrical nature of 
the API vs the assymetrical nature of guest/host or user/kernel 
interfaces.  This is most pronounced in ->inject(); in the host->guest 
direction this is async (host can continue processing while the guest is 
handling the interrupt), whereas in the guest->host direction it is 
synchronous (the guest is blocked while the host is processing the call, 
unless the host explicitly hands off work to a different thread).


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 5:12 am

Oh, duh.  Dumb question.  I was getting confused with "pack", not pad.  :=


Note that this is exactly what I do (though it is device specific).=20
venet-tap has a ioq_notifier registered on its "rx" ring (which is the
tx-ring for the guest) that simply calls ioq_notify_disable() (which
calls shm_signal_disable() under the covers) and it wakes its
rx-thread.  This all happens in the context of the hypercall, which then


From: Avi Kivity
Date: Wednesday, April 1, 2009 - 5:24 am

I think this is suboptimal.  The ring is likely to be cache hot on the 
current cpu, waking a thread will introduce scheduling latency + IPI 
+cache-to-cache transfers.

On a benchmark setup, host resources are likely to exceed guest 
requirements, so you can throw cpu at the problem and no one notices.  
But I think the bits/cycle figure will decrease, even if bits/sec increases.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 6:57 am

Heh, yes I know this is your (well documented) position, but I
respectfully disagree. :)

CPUs are not getting much faster, but they are rapidly getting more
cores.  If we want to continue to make software run increasingly faster,
we need to actually use those cores IMO.  Generally this means split
workloads up into as many threads as possible as long as you can keep
This part is a valid criticism, though note that Linux is very adept at
scheduling so we are talking mere ns/us range here, which is dwarfed by
the latency of something like your typical IO device (e.g. 36us for a
rtt packet on 10GE baremetal, etc).  The benefit, of course, is the
potential for increased parallelism which I have plenty of data to show
we are very much taking advantage of here (I can saturate two cores
almost completely according to LTT traces, one doing vcpu work, and the
This one I take exception to.  While it is perfectly true that splitting
the work between two cores has a greater cache impact than staying on
one, you cannot look at this one metric alone and say "this is bad".=20
Its also a function of how efficiently the second (or more) cores are
utilized.  There will be a point in the curve where the cost of cache
coherence will become marginalized by the efficiency added by the extra
compute power.  Some workloads will invariably be on the bad end of that
curve, and therefore doing the work on one core is better.  However, we
cant ignore that there will others that are on the good end of this
spectrum either.  Otherwise, we risk performance stagnation on our
effectively uniprocessor box ;).  In addition, the task-scheduler will
attempt to co-locate tasks that are sharing data according to a best-fit
within the cache hierarchy.  Therefore, we will still be sharing as much
as possible (perhaps only L2, L3, or a local NUMA domain, but this is
still better than nothing)

The way I have been thinking about these issues is something I have been
calling "soft-asics".  In the early days, we had ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

We expect to have various types of connection-clients (e.g. userspace,
kvm, etc), each of which is likely to have common access patterns and
marshalling duties.  Therefore we create a "client" API to simplify
client development by helping with mundane tasks such as handle-2-pointer
translation, etc.

Special thanks to Pat Mullaney for suggesting the optimization to pass
a cookie object down during DEVICESHM operations to save lookup overhead
on the event channel.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/vbus_client.h |  115 +++++++++
 kernel/vbus/Makefile        |    2 
 kernel/vbus/client.c        |  527 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 643 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/vbus_client.h
 create mode 100644 kernel/vbus/client.c

diff --git a/include/linux/vbus_client.h b/include/linux/vbus_client.h
new file mode 100644
index 0000000..62dab78
--- /dev/null
+++ b/include/linux/vbus_client.h
@@ -0,0 +1,115 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Virtual-Bus - Client interface
+ *
+ * We expect to have various types of connection-clients (e.g. userspace,
+ * kvm, etc).  Each client will be connecting from some environment outside
+ * of the kernel, and therefore will not have direct access to the API as
+ * presented in ./linux/vbus.h.  There will undoubtedly be some parameter
+ * marshalling that must occur, as well as common patterns for the handling
+ * of those marshalled parameters (e.g. translating a handle into a pointer,
+ * etc).
+ *
+ * Therefore this "client" API is provided to simplify the development
+ * of any clients.  Of course, a client is free to bypass this API entirely
+ * and communicate with the direct VBUS API if desired.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/venet.h |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 47 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/venet.h

diff --git a/include/linux/venet.h b/include/linux/venet.h
new file mode 100644
index 0000000..ef6b199
--- /dev/null
+++ b/include/linux/venet.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright 2008 Novell.  All Rights Reserved.
+ *
+ * Virtual-Ethernet adapter
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VENET_H
+#define _LINUX_VENET_H
+
+#define VENET_VERSION 1
+
+#define VENET_TYPE "virtual-ethernet"
+
+#define VENET_QUEUE_RX 0
+#define VENET_QUEUE_TX 1
+
+struct venet_capabilities {
+	__u32 gid;
+	__u32 bits;
+};
+
+/* CAPABILITIES-GROUP 0 */
+/* #define VENET_CAP_FOO    0   (No capabilities defined yet, for now) */
+
+#define VENET_FUNC_LINKUP   0
+#define VENET_FUNC_LINKDOWN 1
+#define VENET_FUNC_MACQUERY 2
+#define VENET_FUNC_NEGCAP   3 /* negotiate capabilities */
+#define VENET_FUNC_FLUSHRX  4
+
+#endif /* _LINUX_VENET_H */

--

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

We can map these over VBUS shared memory (or really any shared-memory
architecture if it supports shm-signals) to allow asynchronous
communication between two end-points.  Memory is synchronized using
pure barriers (i.e. lockless), so IOQs are friendly in many contexts,
even if the memory is remote.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/ioq.h |  410 +++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig         |   12 +
 lib/Makefile        |    1 
 lib/ioq.c           |  298 +++++++++++++++++++++++++++++++++++++
 4 files changed, 721 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ioq.h
 create mode 100644 lib/ioq.c

diff --git a/include/linux/ioq.h b/include/linux/ioq.h
new file mode 100644
index 0000000..d450d9a
--- /dev/null
+++ b/include/linux/ioq.h
@@ -0,0 +1,410 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * IOQ is a generic shared-memory, lockless queue mechanism. It can be used
+ * in a variety of ways, though its intended purpose is to become the
+ * asynchronous communication path for virtual-bus drivers.
+ *
+ * The following are a list of key design points:
+ *
+ * #) All shared-memory is always allocated on explicitly one side of the
+ *    link.  This typically would be the guest side in a VM/VMM scenario.
+ * #) Each IOQ has the concept of "north" and "south" locales, where
+ *    north denotes the memory-owner side (e.g. guest).
+ * #) An IOQ is manipulated using an iterator idiom.
+ * #) Provides a bi-directional signaling/notification infrastructure on
+ *    a per-queue basis, which includes an event mitigation strategy
+ *    to reduce boundary switching.
+ * #) The signaling path is abstracted so that various technologies and
+ *    topologies can define their own specific implementation while sharing
+ *    the basic structures and code.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:42 am

See Documentation/vbus.txt for details

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 Documentation/vbus.txt      |  386 +++++++++++++++++++++++++++++
 arch/x86/Kconfig            |    2 
 fs/proc/base.c              |   96 +++++++
 include/linux/sched.h       |    4 
 include/linux/vbus.h        |  147 +++++++++++
 include/linux/vbus_device.h |  416 ++++++++++++++++++++++++++++++++
 kernel/Makefile             |    1 
 kernel/exit.c               |    2 
 kernel/fork.c               |    2 
 kernel/vbus/Kconfig         |   14 +
 kernel/vbus/Makefile        |    1 
 kernel/vbus/attribute.c     |   52 ++++
 kernel/vbus/config.c        |  275 +++++++++++++++++++++
 kernel/vbus/core.c          |  567 +++++++++++++++++++++++++++++++++++++++++++
 kernel/vbus/devclass.c      |  124 +++++++++
 kernel/vbus/map.c           |   72 +++++
 kernel/vbus/map.h           |   41 +++
 kernel/vbus/vbus.h          |  116 +++++++++
 18 files changed, 2318 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vbus.txt
 create mode 100644 include/linux/vbus.h
 create mode 100644 include/linux/vbus_device.h
 create mode 100644 kernel/vbus/Kconfig
 create mode 100644 kernel/vbus/Makefile
 create mode 100644 kernel/vbus/attribute.c
 create mode 100644 kernel/vbus/config.c
 create mode 100644 kernel/vbus/core.c
 create mode 100644 kernel/vbus/devclass.c
 create mode 100644 kernel/vbus/map.c
 create mode 100644 kernel/vbus/map.h
 create mode 100644 kernel/vbus/vbus.h

diff --git a/Documentation/vbus.txt b/Documentation/vbus.txt
new file mode 100644
index 0000000..e8a05da
--- /dev/null
+++ b/Documentation/vbus.txt
@@ -0,0 +1,386 @@
+
+Virtual-Bus:
+======================
+Author: Gregory Haskins <ghaskins@novell.com>
+
+
+
+
+What is it?
+--------------------
+
+Virtual-Bus is a kernel based IO resource container technology.  It is modeled
+on a concept similar to the Linux Device-Model (LDM), where we have buses,
+devices, and drivers as the primary actors.  ...
From: Ben Hutchings
Date: Thursday, April 2, 2009 - 9:06 am

On Tue, 2009-03-31 at 14:42 -0400, Gregory Haskins wrote:

This is kind of patronising; why don't you simply lay out how things

How about exposing a subdir for each device class under
/config/vbus/devices/ and allowing device creation only within those?
Two-stage construction is a pain for both users and implementors.

[...]

It seems to me that your "device-classes" correspond to drivers and
"interfaces" correspond to device classes in the LDM.  To avoid
confusion, I think the vbus terminology should be made consistent with
LDM.  And certainly these should not both be called simply "type" in the
configfs/sysfs interface.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 11:13 am

Hi Ben


Ya, point taken.  I think that was written really to myself, because my
first design *had* the device as a subordinate object.  Then I realized
later that I didn't like that design :)

I am not sure I follow.  It sounds like you are suggesting exactly what

I think I worded this awkwardly.  A device-class creates a
device-instance.  A device-instance registers one or more interfaces.=20
There are device types (of which I would classify both the device-class
and its instantiated device object as the same "type"), and there are
interface types.  The interface types may overlap across different
device types, as demonstrated below.  I will update the doc to be more
I don't think that is quite right, but I might be missing your point.=20
All of these objects exist on the "backend", of which there isnt a
specific precedent with LDM to express.  Normally in LDM, you would have
some kind of physical device object in the hardware (say a SATA disk),
and an LDM "block device" that represents it in software.  So we call
the LDM model for that disk a "device" but really its like a proxy or a
software representative of the actual device itself.  And I am not
knocking this designation, as I think it makes a lot of sense.

However, what I will point out is that what we are creating here in vbus
is more akin to the SATA disk itself, not the LDM "block device"
representation of the device.   There was no really great existing way
to express this type of object, which is why I had to create a new
namespace in sysfs.

To dig down into this a little further, the device and interface are
inextricably linked in a relationship very close to this "physical
device" concept.  Therefore the "driver" portion of LDM that you
referenced w.r.t. the device-class doesnt even enter the picture here
(that would actually be up in the guest or userspace, actually.=20
Discussed below).

As an example, consider a e1000 network card.  The PCI-ID and REV for
the e1000 card and the associated ABI are like ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

This will generally be used for hypervisors to publish any host-side
virtual devices up to a guest.  The guest will have the opportunity
to consume any devices present on the vbus-proxy as if they were
platform devices, similar to existing buses like PCI.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/vbus_driver.h |   73 +++++++++++++++++++++
 kernel/vbus/Kconfig         |    9 +++
 kernel/vbus/Makefile        |    4 +
 kernel/vbus/proxy.c         |  152 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 238 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/vbus_driver.h
 create mode 100644 kernel/vbus/proxy.c

diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
new file mode 100644
index 0000000..c53e13f
--- /dev/null
+++ b/include/linux/vbus_driver.h
@@ -0,0 +1,73 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Mediates access to a host VBUS from a guest kernel by providing a
+ * global view of all VBUS devices
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_DRIVER_H
+#define _LINUX_VBUS_DRIVER_H
+
+#include <linux/device.h>
+#include <linux/shm_signal.h>
+
+struct vbus_device_proxy;
+struct vbus_driver;
+
+struct vbus_device_proxy_ops {
+	int (*open)(struct ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:44 am

This adds a driver to interface between the host VBUS support, and the
guest-vbus bus model.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/Kconfig            |    9 +
 drivers/Makefile            |    1 
 drivers/vbus/proxy/Makefile |    2 
 drivers/vbus/proxy/kvm.c    |  726 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 738 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vbus/proxy/Makefile
 create mode 100644 drivers/vbus/proxy/kvm.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 91fefd5..8661495 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -451,6 +451,15 @@ config KVM_GUEST_DYNIRQ
        depends on KVM_GUEST
        default y
 
+config KVM_GUEST_VBUS
+       tristate "KVM virtual-bus (VBUS) guest-side support"
+       depends on KVM_GUEST
+       select VBUS_DRIVERS
+       default y
+       ---help---
+          This option enables guest-side support for accessing virtual-bus
+	  devices.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT
diff --git a/drivers/Makefile b/drivers/Makefile
index 98fab51..4f2cb93 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -107,3 +107,4 @@ obj-$(CONFIG_VIRTIO)		+= virtio/
 obj-$(CONFIG_STAGING)		+= staging/
 obj-y				+= platform/
 obj-$(CONFIG_VBUS_DEVICES)	+= vbus/devices/
+obj-$(CONFIG_VBUS_DRIVERS)	+= vbus/proxy/
diff --git a/drivers/vbus/proxy/Makefile b/drivers/vbus/proxy/Makefile
new file mode 100644
index 0000000..c18d58d
--- /dev/null
+++ b/drivers/vbus/proxy/Makefile
@@ -0,0 +1,2 @@
+kvm-guest-vbus-objs += kvm.o
+obj-$(CONFIG_KVM_GUEST_VBUS) += kvm-guest-vbus.o
diff --git a/drivers/vbus/proxy/kvm.c b/drivers/vbus/proxy/kvm.c
new file mode 100644
index 0000000..82e28b4
--- /dev/null
+++ b/drivers/vbus/proxy/kvm.c
@@ -0,0 +1,726 @@
+/*
+ * Copyright (C) 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

The ioapic code currently privately manages the mapping between irq
and vector.  This results in some layering violations as the support
for certain MSI operations need this info.  As a result, the MSI
code itself was moved to the ioapic module.  This is not really
optimal.

We now have another need to gain access to the vector assignment on
x86.  However, rather than put yet another inappropriately placed
function into io-apic, lets create a way to export this simple data
and therefore allow the logic to sit closer to where it belongs.

Ideally we should abstract the entire notion of irq->vector management
out of io-apic, but we leave that as an excercise for another day.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/include/asm/irq.h |    6 ++++++
 arch/x86/kernel/io_apic.c  |   25 +++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 592688e..b1726d8 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -40,6 +40,12 @@ extern unsigned int do_IRQ(struct pt_regs *regs);
 extern void init_IRQ(void);
 extern void native_init_IRQ(void);
 
+#ifdef CONFIG_SMP
+extern int set_irq_affinity(int irq, cpumask_t mask);
+#endif
+
+extern int irq_to_vector(int irq);
+
 /* Interrupt vector management */
 extern DECLARE_BITMAP(used_vectors, NR_VECTORS);
 extern int vector_used_by_percpu_irq(unsigned int vector);
diff --git a/arch/x86/kernel/io_apic.c b/arch/x86/kernel/io_apic.c
index bc7ac4d..86a2c36 100644
--- a/arch/x86/kernel/io_apic.c
+++ b/arch/x86/kernel/io_apic.c
@@ -614,6 +614,14 @@ set_ioapic_affinity_irq(unsigned int irq, const struct cpumask *mask)
 
 	set_ioapic_affinity_irq_desc(desc, mask);
 }
+
+int set_irq_affinity(int irq, cpumask_t mask)
+{
+	set_ioapic_affinity_irq(irq, &mask);
+
+	return 0;
+}
+
 #endif /* CONFIG_SMP */
 
 /*
@@ -3249,6 +3257,23 @@ void destroy_irq(unsigned int irq)
 ...
From: Alan Cox
Date: Tuesday, March 31, 2009 - 12:16 pm

On Tue, 31 Mar 2009 14:43:55 -0400

This appears to have been muddled in with the vnet patches ?
--

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 1:02 pm

Its needed for the kvm-connector patches later in the series, so it was
included intentionally.

On that topic, I probably should have had a TOC of some kind.  Hmm..let
me hack one together now:

Patch 1: Stand-alone "shared-memory signal" construct, used by various
components in vbus/venet
Patches 2-5: Basic vbus infrastructure
Patches 6-7: IOQ construct, similar to virtio-ring.  Used to overlay
ring-like behavior over the shm interface in vbus
Patches 8-12: virtual-ethernet front and backends
Patch 13: io-apic work to expose the irq-vector in x86, needed for
dynirq support
Patches 14-16: KVM host side support
Patch 17: KVM guest side support

Sorry for the confusion :(

Regards,
-Greg

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

This module is similar in concept to a "tuntap".  A tuntap module provides
a netif() interface on one side, and a char-dev interface on the other.
Packets that ingress on one interface, egress on the other (and vice versa).

This module offers a similar concept, except that it substitues the
char-dev for a VBUS/IOQ interface.  This allows a VBUS compatible entity
(e.g. userspace or a guest) to directly inject and receive packets
from the host/kernel stack.

Thanks to Pat Mullaney for contributing the maxcount modification

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/Makefile                 |    1 
 drivers/vbus/devices/Kconfig     |   17 
 drivers/vbus/devices/Makefile    |    1 
 drivers/vbus/devices/venet-tap.c | 1365 ++++++++++++++++++++++++++++++++++++++
 kernel/vbus/Kconfig              |   13 
 5 files changed, 1397 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vbus/devices/Kconfig
 create mode 100644 drivers/vbus/devices/Makefile
 create mode 100644 drivers/vbus/devices/venet-tap.c

diff --git a/drivers/Makefile b/drivers/Makefile
index c1bf417..98fab51 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -106,3 +106,4 @@ obj-$(CONFIG_SSB)		+= ssb/
 obj-$(CONFIG_VIRTIO)		+= virtio/
 obj-$(CONFIG_STAGING)		+= staging/
 obj-y				+= platform/
+obj-$(CONFIG_VBUS_DEVICES)	+= vbus/devices/
diff --git a/drivers/vbus/devices/Kconfig b/drivers/vbus/devices/Kconfig
new file mode 100644
index 0000000..64e4731
--- /dev/null
+++ b/drivers/vbus/devices/Kconfig
@@ -0,0 +1,17 @@
+#
+# Virtual-Bus (VBus) configuration
+#
+
+config VBUS_VENETTAP
+       tristate "Virtual-Bus Ethernet Tap Device"
+       depends on VBUS_DEVICES
+       default n
+       help
+        Provides a virtual ethernet adapter to a vbus, which in turn
+        manifests itself as a standard netif based adapter to the
+	kernel.  It can be used similarly to a "tuntap" device,
+        except that the char-dev transport is replaced with a vbus/ioq
+        ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/net/vbus-enet.c |  249 +++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/venet.h   |   39 +++++++
 2 files changed, 275 insertions(+), 13 deletions(-)

diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
index e698b3f..8e96c9c 100644
--- a/drivers/net/vbus-enet.c
+++ b/drivers/net/vbus-enet.c
@@ -42,6 +42,8 @@ static int rx_ringlen = 256;
 module_param(rx_ringlen, int, 0444);
 static int tx_ringlen = 256;
 module_param(tx_ringlen, int, 0444);
+static int sg_enabled = 1;
+module_param(sg_enabled, int, 0444);
 
 #undef PDEBUG             /* undef it, just in case */
 #ifdef VBUS_ENET_DEBUG
@@ -64,8 +66,17 @@ struct vbus_enet_priv {
 	struct vbus_enet_queue     rxq;
 	struct vbus_enet_queue     txq;
 	struct tasklet_struct      txtask;
+	struct {
+		int                sg:1;
+		int                tso:1;
+		int                ufo:1;
+		int                tso6:1;
+		int                ecn:1;
+	} flags;
 };
 
+static void vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force);
+
 static struct vbus_enet_priv *
 napi_to_priv(struct napi_struct *napi)
 {
@@ -199,6 +210,93 @@ rx_teardown(struct vbus_enet_priv *priv)
 	}
 }
 
+static int
+tx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int i;
+	int ret;
+
+	if (!priv->flags.sg)
+		/*
+		 * There is nothing to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return 0;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SG descriptor
+	 */
+	for (i = 0; i < tx_ringlen; i++) {
+		struct venet_sg *vsg;
+		size_t iovlen = sizeof(struct venet_iov) * (MAX_SKB_FRAGS-1);
+		size_t len = sizeof(*vsg) + iovlen;
+
+		vsg = kzalloc(len, GFP_KERNEL);
+		if (!vsg)
+			return ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/net/Kconfig     |   13 +
 drivers/net/Makefile    |    1 
 drivers/net/vbus-enet.c |  706 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 720 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/vbus-enet.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 62d732a..ac9dabd 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -3099,4 +3099,17 @@ config VIRTIO_NET
 	  This is the virtual network driver for virtio.  It can be used with
           lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
+config VBUS_ENET
+	tristate "Virtual Ethernet Driver"
+	depends on VBUS_DRIVERS
+	help
+	   A virtualized 802.x network device based on the VBUS interface.
+	   It can be used with any hypervisor/kernel that supports the
+	   vbus protocol.
+
+config VBUS_ENET_DEBUG
+        bool "Enable Debugging"
+	depends on VBUS_ENET
+	default n
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 471baaf..61db928 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -264,6 +264,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_NETXEN_NIC) += netxen/
 obj-$(CONFIG_NIU) += niu.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
 obj-$(CONFIG_SFC) += sfc/
 
 obj-$(CONFIG_WIMAX) += wimax/
diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
new file mode 100644
index 0000000..e698b3f
--- /dev/null
+++ b/drivers/net/vbus-enet.c
@@ -0,0 +1,706 @@
+/*
+ * vbus_enet - A virtualized 802.x network device based on the VBUS interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <ghaskins@novell.com>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include ...
From: Stephen Hemminger
Date: Tuesday, March 31, 2009 - 1:39 pm

On Tue, 31 Mar 2009 14:43:34 -0400








Please consider adding basic set of ethtool_ops to allow controlling
offload, etc.
--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 4:43 am

Thanks for the review, Stephen!

I will apply all of your recommended fixes for the next release.

-Greg

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:44 am

We need a way to detect if a VM is reset later in the series, so lets
add a capability for userspace to signal a VM reset down to the kernel.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/kvm/x86.c       |    1 +
 include/linux/kvm.h      |    2 ++
 include/linux/kvm_host.h |    6 ++++++
 virt/kvm/kvm_main.c      |   36 ++++++++++++++++++++++++++++++++++++
 4 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 758b7a1..9b0a649 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -971,6 +971,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_NOP_IO_DELAY:
 	case KVM_CAP_MP_STATE:
 	case KVM_CAP_SYNC_MMU:
+	case KVM_CAP_RESET:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 0424326..7ffd8f5 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -396,6 +396,7 @@ struct kvm_trace_rec {
 #ifdef __KVM_HAVE_USER_NMI
 #define KVM_CAP_USER_NMI 22
 #endif
+#define KVM_CAP_RESET 23
 
 /*
  * ioctls for VM fds
@@ -429,6 +430,7 @@ struct kvm_trace_rec {
 				   struct kvm_assigned_pci_dev)
 #define KVM_ASSIGN_IRQ _IOR(KVMIO, 0x70, \
 			    struct kvm_assigned_irq)
+#define KVM_RESET	          _IO(KVMIO,  0x67)
 
 /*
  * ioctls for vcpu fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bf6f703..506eca1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -17,6 +17,7 @@
 #include <linux/preempt.h>
 #include <linux/marker.h>
 #include <linux/msi.h>
+#include <linux/notifier.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -132,6 +133,8 @@ struct kvm {
 	unsigned long mmu_notifier_seq;
 	long mmu_notifier_count;
 #endif
+
+	struct raw_notifier_head reset_notifier; /* triggers when VM reboots */
 };
 
 /* The guest did something we don't support. */
@@ -158,6 +161,9 @@ void kvm_exit(void);
 void kvm_get_kvm(struct kvm *kvm);
 void ...
From: Avi Kivity
Date: Tuesday, March 31, 2009 - 12:22 pm

How do you handle the case of a guest calling kexec to load a new 
kernel?  Or is that not important for your use case?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 1:02 pm

Hmm..I had not considered this.  Any suggestions on ways to detect it?

From: Avi Kivity
Date: Tuesday, March 31, 2009 - 1:18 pm

Best would be not to detect it; it's tying global events into a device.  
Instead, have a reset command for your device and have the driver issue 
it on load and unload.

btw, reset itself would be better controlled from userspace; qemu knows 
about resets and can reset vbus devices directly instead of relying on 
kvm to reset them.  This decouples the two code bases a bit.  This is 
what virtio does.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 1:37 pm

Yes, good point.  This is doable within the existing infrastructure, but
it would have to be declared in each devices ABI definition.  I could
make it more formal and add it to the list of low-level bus-verbs, like
In a way, this is what I have done (note to self: post the userspace
patches)

The detection is done by userspace, and it invokes an ioctl.  The kernel
based devices then react if they are interested.  In my case, vbus
registers for reset-notification, and it acts as if the guest exited
when it gets reset (e.g. it issues DEVICECLOSE verbs to all devices the
guest had open).


From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:44 am

This patch provides the ability to dynamically declare and map an
interrupt-request handle to an x86 8-bit vector.

Problem Statement: Emulated devices (such as PCI, ISA, etc) have
interrupt routing done via standard PC mechanisms (MP-table, ACPI,
etc).  However, we also want to support a new class of devices
which exist in a new virtualized namespace and therefore should
not try to piggyback on these emulated mechanisms.  Rather, we
create a way to dynamically register interrupt resources that
acts indepent of the emulated counterpart.

On x86, a simplistic view of the interrupt model is that each core
has a local-APIC which can recieve messages from APIC-compliant
routing devices (such as IO-APIC and MSI) regarding details about
an interrupt (such as which vector to raise).  These routing devices
are controlled by the OS so they may translate a physical event
(such as "e1000: raise an RX interrupt") to a logical destination
(such as "inject IDT vector 46 on core 3").  A dynirq is a virtual
implementation of such a router (think of it as a virtual-MSI, but
without the coupling to an existing standard, such as PCI).

The model is simple: A guest OS can allocate the mapping of "IRQ"
handle to "vector/core" in any way it sees fit, and provide this
information to the dynirq module running in the host.  The assigned
IRQ then becomes the sole handle needed to inject an IDT vector
to the guest from a host.  A host entity that wishes to raise an
interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
is performed transparently.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/Kconfig                |    5 +
 arch/x86/Makefile               |    3 
 arch/x86/include/asm/kvm_host.h |    9 +
 arch/x86/include/asm/kvm_para.h |   11 +
 arch/x86/kvm/Makefile           |    3 
 arch/x86/kvm/dynirq.c           |  329 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/guest/Makefile     |    2 
 arch/x86/kvm/guest/dynirq.c     |   95 +++++++++++
 ...
From: Avi Kivity
Date: Tuesday, March 31, 2009 - 12:20 pm

A major disadvantage of dynirq is that it will only work on guests which 
have been ported to it.  So this will only be useful on newer Linux, and 
will likely never work with Windows guests.

Why is having an emulated PCI device so bad?  We found that it has 
several advantages:
 - works with all guests
 - supports hotplug/hotunplug, udev, sysfs, module autoloading, ...
 - supported in all OSes
 - someone else maintains it

See also the kvm irq routing work, merged into 2.6.30, which does a 
small part of what you're describing (the "sole handle" part, specifically).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 12:39 pm

These points are all valid, and I really struggled with this particular
part of the design.  The entire vbus design only requires one IRQ for
the entire guest, so its conceivable that I could present a simple
"dummy" PCI device with some "VBUS" type PCI-ID, just to piggy back on
the IRQ routing logic.  Then userspace could simply pass the IRQ routing
info down to the kernel with an ioctl, or something similar.

Ultimately I wasn't sure whether I wanted all that goo just to get an
IRQ assignment...but on the other hand, we have all this goo to build
one in the first place, and its half on the guest side which has the
disadvantages you mention.  So perhaps this should go in favor of a
PCI-esqe type solution, as I think you are suggesting.

I think ultimately I was trying to stay away from PCI in general because
I want to support environments that do not have PCI.  However, for the

I will take a look, thanks!

(I wish I wish you had accepted those irq patches I wrote a while back.=20
It had the foundation for this type of stuff all built in.  But alas, I
think it was before its time, and I didn't do a good job of explaining
my future plans....) ;)

Regards,
-Greg




From: Avi Kivity
Date: Tuesday, March 31, 2009 - 1:13 pm

Won't this have scaling issues?  One IRQ means one target vcpu.  Whereas 
I'd like virtio devices to span multiple queues, each queue with its own 
MSI IRQ.  Also, the single IRQ handler will need to scan for all 
potential IRQ sources.  Even if implemented carefully, this will cause 


s/PCI/the native IRQ solution for your platform/. virtio has the same 
problem; on s390 we use the native (if that word ever applies to s390) 
interrupt and device discovery mechanism.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 1:32 pm

Hmm..you know I hadnt really thought of it that way, but you have a
point.  To clarify, my design actually uses one IRQ per "eventq", where
we can have an arbitrary number of eventq's defined (note: today I only
define one eventq, however).  An eventq is actually a shm-ring construct
where I can pass events up to the host like "device added" or "ring X
signaled".  Each individual device based virtio-ring would then
aggregates "signal" events onto this eventq mechanism to actually inject
events to the host.  Only the eventq itself injects an actual IRQ to the
assigned vcpu.

My intended use of multiple eventqs was for prioritization of different
rings.  For instance, we could define 8 priority levels, each with its
own ring/irq.  That way, a virtio-net that supports something like
802.1p could define 8 virtio-rings, one for each priority level.

But this scheme is more targeted at prioritization than per vcpu
irq-balancing.  I support the eventq construct I proposed could still be
used in this fashion since each has its own routable IRQ.  However, I
would have to think about that some more because it is beyond the design
spec.

The good news is that the decision to use the "eventq+irq" approach is
completely contained in the kvm-host+guest.patch.  We could easily
switch to a 1:1 irq:shm-signal if we wanted to, and the device/drivers
Well, no, I think this part is covered.  As mentioned above, we use a
queuing technique so there is no scanning needed.  Ultimately I would
love to adapt a similar technique to optionally replace the LAPIC.  That
way we can avoid the EOI trap and just consume the next interrupt (if

yeah, I agree.  We can contain the "exposure" of PCI to just platforms
within KVM that care about it.

-Greg


From: Avi Kivity
Date: Tuesday, March 31, 2009 - 1:59 pm

You will get get cachelines bounced around when events from different 
devices are added to the queue.  On the plus side, a single injection 
can contain interrupts for multiple devices.

I'm not sure how useful this coalescing is; certainly you will never see 
it on microbenchmarks, but that doesn't mean it's not useful.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:44 am

This patch adds support for guest access to a VBUS assigned to the same
context as the VM.  It utilizes a IOQ+IRQ to move events from host->guest,
and provides a hypercall interface to move events guest->host.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/include/asm/kvm_para.h |    1 
 arch/x86/kvm/Kconfig            |    9 
 arch/x86/kvm/Makefile           |    3 
 arch/x86/kvm/x86.c              |    6 
 arch/x86/kvm/x86.h              |   12 
 include/linux/kvm.h             |    1 
 include/linux/kvm_host.h        |   20 +
 include/linux/kvm_para.h        |   59 ++
 virt/kvm/kvm_main.c             |    1 
 virt/kvm/vbus.c                 | 1307 +++++++++++++++++++++++++++++++++++++++
 10 files changed, 1419 insertions(+), 0 deletions(-)
 create mode 100644 virt/kvm/vbus.c

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index fba210e..19d81e0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -14,6 +14,7 @@
 #define KVM_FEATURE_NOP_IO_DELAY	1
 #define KVM_FEATURE_MMU_OP		2
 #define KVM_FEATURE_DYNIRQ		3
+#define KVM_FEATURE_VBUS                4
 
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b81125f..875e96e 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,15 @@ config KVM_TRACE
 	  relayfs.  Note the ABI is not considered stable and will be
 	  modified in future updates.
 
+config KVM_HOST_VBUS
+       bool "KVM virtual-bus (VBUS) host-side support"
+       depends on KVM
+       select VBUS
+       default n
+       ---help---
+          This option enables host-side support for accessing virtual-bus
+	  devices.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/lguest/Kconfig
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index d5676f5..f749ec9 100644
--- ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/vbus/devices/venet-tap.c |  236 +++++++++++++++++++++++++++++++++++++-
 1 files changed, 229 insertions(+), 7 deletions(-)

diff --git a/drivers/vbus/devices/venet-tap.c b/drivers/vbus/devices/venet-tap.c
index ccce58e..0ccb7ed 100644
--- a/drivers/vbus/devices/venet-tap.c
+++ b/drivers/vbus/devices/venet-tap.c
@@ -80,6 +80,13 @@ enum {
 	TX_IOQ_CONGESTED,
 };
 
+struct venettap;
+
+struct venettap_rx_ops {
+	int (*decode)(struct venettap *priv, void *ptr, int len);
+	int (*import)(struct venettap *, struct sk_buff *, void *, int);
+};
+
 struct venettap {
 	spinlock_t                   lock;
 	unsigned char                hmac[ETH_ALEN]; /* host-mac */
@@ -107,6 +114,12 @@ struct venettap {
 		struct vbus_memctx          *ctx;
 		struct venettap_queue        rxq;
 		struct venettap_queue        txq;
+		struct venettap_rx_ops      *rx_ops;
+		struct {
+			struct venet_sg     *desc;
+			size_t               len;
+			int                  enabled:1;
+		} sg;
 		int                          connected:1;
 		int                          opened:1;
 		int                          link:1;
@@ -288,6 +301,183 @@ venettap_change_mtu(struct net_device *dev, int new_mtu)
 }
 
 /*
+ * ---------------------------
+ * Scatter-Gather support
+ * ---------------------------
+ */
+
+/* assumes reference to priv->vbus.conn held */
+static int
+venettap_sg_decode(struct venettap *priv, void *ptr, int len)
+{
+	struct venet_sg *vsg;
+	struct vbus_memctx *ctx;
+	int ret;
+
+	/*
+	 * SG is enabled, so we need to pull in the venet_sg
+	 * header before we can interpret the rest of the
+	 * packet
+	 *
+	 * FIXME: Make sure this is not too big
+	 */
+	if (unlikely(len > priv->vbus.sg.len)) {
+		kfree(priv->vbus.sg.desc);
+		priv->vbus.sg.desc = kzalloc(len, GFP_KERNEL);
+	}
+
+	vsg = priv->vbus.sg.desc;
+	ctx = priv->vbus.ctx;
+
+	ret = ctx->ops->copy_from(ctx, vsg, ptr, ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

We need to get hotswap events in environments which cannot use existing
facilities (e.g. inotify).  So we add a notifier-chain to allow client
callbacks whenever an interface is {un}registered.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/vbus.h |   15 +++++++++++++
 kernel/vbus/core.c   |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/vbus/vbus.h   |    1 +
 3 files changed, 75 insertions(+), 0 deletions(-)

diff --git a/include/linux/vbus.h b/include/linux/vbus.h
index 5f0566c..04db4ff 100644
--- a/include/linux/vbus.h
+++ b/include/linux/vbus.h
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/rcupdate.h>
 #include <linux/vbus_device.h>
+#include <linux/notifier.h>
 
 struct vbus;
 struct task_struct;
@@ -137,6 +138,20 @@ static inline void task_vbus_disassociate(struct task_struct *p)
 	}
 }
 
+enum {
+	VBUS_EVENT_DEVADD,
+	VBUS_EVENT_DEVDROP,
+};
+
+struct vbus_event_devadd {
+	const char   *type;
+	unsigned long id;
+};
+
+int vbus_notifier_register(struct vbus *vbus, struct notifier_block *nb);
+int vbus_notifier_unregister(struct vbus *vbus, struct notifier_block *nb);
+
+
 #else /* CONFIG_VBUS */
 
 #define fork_vbus(p) do { } while (0)
diff --git a/kernel/vbus/core.c b/kernel/vbus/core.c
index 033999f..b6df487 100644
--- a/kernel/vbus/core.c
+++ b/kernel/vbus/core.c
@@ -89,6 +89,7 @@ int vbus_device_interface_register(struct vbus_device *dev,
 {
 	int ret;
 	struct vbus_devshell *ds = to_devshell(dev->kobj);
+	struct vbus_event_devadd ev;
 
 	mutex_lock(&vbus->lock);
 
@@ -124,6 +125,14 @@ int vbus_device_interface_register(struct vbus_device *dev,
 	if (ret)
 		goto error;
 
+	ev.type = intf->type;
+	ev.id   = intf->id;
+
+	/* and let any clients know about the new device */
+	ret = raw_notifier_call_chain(&vbus->notifier, VBUS_EVENT_DEVADD, &ev);
+	if (ret < 0)
+		goto error;
+
 	mutex_unlock(&vbus->lock);
 
 	return 0;
@@ -144,6 +153,7 @@ int ...
From: Gregory Haskins
Date: Tuesday, March 31, 2009 - 11:43 am

It will be common to map an IOQ over the VBUS shared-memory interfaces,
so lets generalize their setup so we can reuse the pattern.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/vbus_device.h |    7 +++
 include/linux/vbus_driver.h |    7 +++
 kernel/vbus/Kconfig         |    2 +
 kernel/vbus/Makefile        |    1 
 kernel/vbus/proxy.c         |   64 +++++++++++++++++++++++++++++++
 kernel/vbus/shm-ioq.c       |   89 +++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 170 insertions(+), 0 deletions(-)
 create mode 100644 kernel/vbus/shm-ioq.c

diff --git a/include/linux/vbus_device.h b/include/linux/vbus_device.h
index 705d92e..66990e2 100644
--- a/include/linux/vbus_device.h
+++ b/include/linux/vbus_device.h
@@ -102,6 +102,7 @@
 #include <linux/configfs.h>
 #include <linux/rbtree.h>
 #include <linux/shm_signal.h>
+#include <linux/ioq.h>
 #include <linux/vbus.h>
 #include <asm/atomic.h>
 
@@ -413,4 +414,10 @@ static inline void vbus_connection_put(struct vbus_connection *conn)
 		conn->ops->release(conn);
 }
 
+/*
+ * device-side IOQ helper - dereferences device-shm as an IOQ
+ */
+int vbus_shm_ioq_attach(struct vbus_shm *shm, struct shm_signal *signal,
+			int maxcount, struct ioq **ioq);
+
 #endif /* _LINUX_VBUS_DEVICE_H */
diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
index c53e13f..9cfbf60 100644
--- a/include/linux/vbus_driver.h
+++ b/include/linux/vbus_driver.h
@@ -26,6 +26,7 @@
 
 #include <linux/device.h>
 #include <linux/shm_signal.h>
+#include <linux/ioq.h>
 
 struct vbus_device_proxy;
 struct vbus_driver;
@@ -70,4 +71,10 @@ struct vbus_driver {
 int vbus_driver_register(struct vbus_driver *drv);
 void vbus_driver_unregister(struct vbus_driver *drv);
 
+/*
+ * driver-side IOQ helper - allocates device-shm and maps an IOQ on it
+ */
+int vbus_driver_ioq_alloc(struct vbus_device_proxy *dev, int id, int prio,
+			  size_t ringsize, struct ioq **ioq);
+
 #endif /* ...
From: Andi Kleen
Date: Tuesday, March 31, 2009 - 1:18 pm

Gregory Haskins <ghaskins@novell.com> writes:

What might be useful is if you could expand a bit more on what the high level
use cases for this. 

Questions that come to mind and that would be good to answer:

This seems to be aimed at having multiple VMs talk
to each other, but not talk to the rest of the world, correct? 
Is that a common use case? 

Wouldn't they typically have a default route  anyways and be able to talk to each 
other this way? 
And why can't any such isolation be done with standard firewalling? (it's known that 
current iptables has some scalability issues, but there's work going on right
now to fix that). 

What would be the use cases for non networking devices?

How would the interfaces to the user look like?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 5:03 am

Actually we didn't design specifically for either type of environment.=20
I think it would, in fact, be well suited to either type of
communication model, even concurrently (e.g. an intra-vm ipc channel
resource could live right on the same bus as a virtio-net and a
vbus itself, and even some of the higher level constructs we apply on
top of it (like venet) are at a different scope than I think what you
are getting at above.  Yes, I suppose you could create a private network
using the existing virtio-net + iptables.  But you could also do the
same using virtio-net and a private bridge devices as well.  That is not
what we are trying to address.

What we *are* trying to address is making an easy way to declare virtual
resources directly in the kernel so that they can be accessed more
efficiently.  Contrast that to the way its done today, where the models
live in, say, qemu userspace.

So instead of having
guest->host->qemu::virtio-net->tap->[iptables|bridge], you simply have
guest->host->[iptables|bridge].  How you make your private network (if
that is what you want to do) is orthogonal...its the path to get there

I am not sure if you are asking about the guests perspective or the
host-administators perspective.

First now lets look at the low-level device interface from the guests
perspective.  We can cover the admin perspective in a separate doc, if
need be.

Each device in vbus supports two basic verbs: CALL, and SHM

int (*call)(struct vbus_device_proxy *dev, u32 func,
            void *data, size_t len, int flags);

int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
           void *ptr, size_t len,
           struct shm_signal_desc *sigdesc, struct shm_signal **signal,
           int flags);

CALL provides a synchronous method for invoking some verb on the device
(defined by "func") with some arbitrary data.  The namespace for "func"
is part of the ABI for the device in question.  It is analogous to an
ioctl, with the primary difference being that its ...
From: Andi Kleen
Date: Wednesday, April 1, 2009 - 6:23 am

But surely you must have some specific use case in mind? Something
that it does better than the various methods that are available
today. Or rather there must be some problem you're trying


I was wondering about the host-administrators perspective.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 7:19 am

Performance.  We are trying to create a high performance IO infrastructur=
e.

Ideally we would like to see things like virtual-machines have
bare-metal performance (or as close as possible) using just pure
software on commodity hardware.   The data I provided shows that
something like KVM with virtio-net does a good job on throughput even on
10GE, but the latency is several orders of magnitude slower than
bare-metal.   We are addressing this issue and others like it that are a

Ah, ok.  Sorry about that.  It was probably good to document that other
thing anyway, so no harm.

So about the host-administrator interface.  The whole thing is driven by
configfs, and the basics are already covered in the documentation in
patch 2, so I wont repeat it here.  Here is a reference to the file for
everyone's convenience:

http://git.kernel.org/?p=3Dlinux/kernel/git/ghaskins/vbus/linux-2.6.git;a=
=3Dblob;f=3DDocumentation/vbus.txt;h=3De8a05dafaca2899d37bd4314fb0c7529c1=
67ee0f;hb=3Df43949f7c340bf667e68af6e6a29552e62f59033

So a sufficiently privileged user can instantiate a new bus (e.g.
container) and devices on that bus via configfs operations.  The types
of devices available to instantiate are dictated by whatever vbus-device
modules you have loaded into your particular kernel.  The loaded modules
available are enumerated under /sys/vbus/deviceclass.

Now presumably the administrator knows what a particular module is and
how to configure it before instantiating it.  Once they instantiate it,
it will present an interface in sysfs with a set of attributes.  For
example, an instantiated venet-tap looks like this:

ghaskins@test:~> tree /sys/vbus/devices
/sys/vbus/devices
`-- foo
    |-- class -> ../../deviceclass/venet-tap
    |-- client_mac
    |-- enabled
    |-- host_mac
    |-- ifname
    `-- interfaces
        `-- 0 -> ../../../instances/bar/devices/0


Some of these attributes, like "class" and "interfaces" are default
attributes that are filled in by the infrastructure. ...
From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 7:42 am

Actually, I should also state that I am interested in enabling some new
kinds of features based on having in-kernel devices like this.  For
instance (and this is still very theoretical and half-baked), I would
like to try to support RT guests.

[adding linux-rt-users]

I think one of the things that we need in order to do that is being able
to convey vcpu priority state information to the host in an efficient
way.  I was thinking that a shared-page per vcpu could have something
like "current" and "theshold" priorties.  The guest modifies "current"
while the host modifies "threshold".   The guest would be allowed to
increase its "current" priority without a hypercall (after all, if its
already running presumably it is already of sufficient priority that the
scheduler).  But if the guest wants to drop below "threshold", it needs
to hypercall the host to give it an opportunity to schedule() a new task
(vcpu or not).

The host, on the other hand, could apply a mapping so that the guests
priority of RT1-RT99 might map to RT20-RT30 on the host, or something
like that.  We would have to take other considerations as well, such as
implicit boosting on IRQ injection (e.g. the guest could be in HLT/IDLE
when an interrupt is injected...but by virtue of injecting that
interrupt we may need to boost it to (guest-relative) RT50).

Like I said, this is all half-baked right now.  My primary focus is
improving performance, but I did try to lay the groundwork for taking
things in new directions too..rt being an example.

Hope that helps!
-Greg


From: Andi Kleen
Date: Wednesday, April 1, 2009 - 10:01 am

Ok. So the goal is to bypass user space qemu completely for better
performance. Can you please put this into the initial patch

How would the guest learn of any changes in there?

I think the interesting part would be how e.g. a vnet device

So it would act like a loop device? Would you reuse the loop device
or write something new?

How about VFS mount name spaces?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Anthony Liguori
Date: Wednesday, April 1, 2009 - 11:45 am

FWIW, there's nothing that prevents in-kernel back ends with virtio so 
vbus certainly isn't required for in-kernel backends.

That said, I don't think we're bound today by the fact that we're in 
userspace.  Rather we're bound by the interfaces we have between the 
host kernel and userspace to generate IO.  I'd rather fix those 
interfaces than put more stuff in the kernel.

Regards,

Anthony Liguori
--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 2:09 pm

I think there is a slight disconnect here.  This is *exactly* what I am
trying to do.  You can of course do this many ways, and I am not denying
it could be done a different way than the path I have chosen.  One
extreme would be to just slam a virtio-net specific chunk of code
directly into kvm on the host.  Another extreme would be to build a
generic framework into Linux for declaring arbitrary IO types,
integrating it with kvm (as well as other environments such as lguest,
userspace, etc), and building a virtio-net model on top of that.

So in case it is not obvious at this point, I have gone with the latter
approach.  I wanted to make sure it wasn't kvm specific or something
like pci specific so it had the broadest applicability to a range of
environments.  So that is why the design is the way it is.  I understand
that this approach is technically "harder/more-complex" than the "slam
virtio-net into kvm" approach, but I've already done that work.  All we
You will *always* be bound by the fact that you are in userspace.  Its
purely a question of "how much" and "does anyone care".    Right now,
the anwer is "a lot (roughly 45x slower)" and "at least Greg's customers
do".  I have no doubt that this can and will change/improve in the
future.  But it will always be true that no matter how much userspace
improves, the kernel based solution will always be faster.  Its simple
physics.  I'm cutting out the middleman to ultimately reach the same
destination as the userspace path, so userspace can never be equal.

I agree that the "does anyone care" part of the equation will approach
zero as the latency difference shrinks across some threshold (probably
the single microsecond range), but I will believe that is even possible
when I see it ;)

Regards,
-Greg

From: Anthony Liguori
Date: Wednesday, April 1, 2009 - 5:29 pm

If it were exactly what you were trying to do, you would have posted a 
virtio-net in-kernel backend implementation instead of a whole new 

Again, let's talk numbers.  A heavy-weight exit is 1us slower than a 
light weight exit.  Ideally, you're taking < 1 exit per packet because 
you're batching notifications.  If you're ping latency on bare metal 
compared to vbus is 39us to 65us, then all other things being equally, 
the cost imposed by doing what your doing in userspace would make the 
latency be 66us taking your latency from 166% of native to 169% of 
native.  That's not a huge difference and I'm sure you'll agree there 
are a lot of opportunities to improve that even further.

And you didn't mention whether your latency tests are based on ping or 
something more sophisticated as ping will be a pathological case that 

Note the other hat we have to where is not just virtualization developer 
but Linux developer.  If there are bad userspace interfaces for IO that 
impose artificial restrictions, then we need to identify those and fix them.

Regards,


--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 8:11 pm

semantics, semantics ;)


Ok, so lets see it happen.  Consider the gauntlet thrown :)  Your
challenge, should you chose to accept it, is to take todays 4000us and
hit a 65us latency target while maintaining 10GE line-rate (at least
1500 mtu line-rate).

I personally don't want to even stop at 65.  I want to hit that 36us! =20
In case you think that is crazy, my first prototype of venet was hitting
about 140us, and I shaved 10us here, 10us there, eventually getting down
to the 65us we have today.  The low hanging fruit is all but harvested
at this point, but I am not done searching for additional sources of
latency. I just needed to take a breather to get the code out there for

Well, the numbers posted were actually from netperf -t UDP_RR.  This
generates a pps from a continuous (but non-bursted) RTT measurement.  So
I invert the pps result of this test to get the average rtt time.  I
have also confirmed that ping jives with these results (e.g. virtio-net
results were about 4ms, and venet were about 0.065ms as reported by ping)=
Ah, but this is not really pathological IMO.  There are plenty of
workloads that exhibit request-reply patterns (e.g. RPC), and this is a
direct measurement of the systems ability to support these
efficiently.   And even unidirectional flows can be hampered by poor
latency (think PTP clock sync, etc).

Massive throughput with poor latency is like Andrew Tanenbaum's
station-wagon full of backup tapes ;)  I think I have proven we can
Well, if we can take anything away from all this: I think I have
demonstrated that you don't need notification batching to get good
throughput.  And batching on the head-end of the queue adds directly to
your latency overhead, so I don't think its a good technique in general
(though I realize that not everyone cares about latency, per se, so

Fair enough, and I would love to take that on but alas my
development/debug bandwidth is rather finite these days ;)

-Greg


From: Avi Kivity
Date: Wednesday, April 1, 2009 - 11:51 pm

virtio is already non-kvm-specific (lguest uses it) and non-pci-specific 

If you have a good exit mitigation scheme you can cut exits by a factor 
of 100; so the userspace exit costs are cut by the same factor.  If you 
have good copyless networking APIs you can cut the cost of copies to 
zero (well, to the cost of get_user_pages_fast(), but a kernel solution 
needs that too).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 1:52 am

I think Greg's work shows that putting the backend in the kernel
can dramatically reduce the cost of a single guest->host transaction.

Given the choice of having to mitigate or not having the problem
in the first place, guess what I would prefer :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Thursday, April 2, 2009 - 2:02 am

Virtio suffers because we've had no notification of when a packet is 
actually submitted.  With the notification, the only difference should 
be in the cost of a kernel->user switch, which is nowhere nearly as 

There is no choice.  Exiting from the guest to the kernel to userspace 
is prohibitively expensive, you can't do that on every packet.

-- 
error compiling committee.c: too many arguments to function

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 2:16 am

I was referring to the bit between the kernel and userspace.

In any case, I just looked at the virtio mitigation code again
and I am completely baffled at why we need it.  Look at Greg's
code or the netback/netfront notification, why do we need this
completely artificial mitigation when the ring itself provides
a natural way of stemming the flow?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Thursday, April 2, 2009 - 2:27 am

If the vcpu thread does the transmit, then it will always complete 
sending immediately:

  guest: push packet, notify qemu
  qemu: disable notification
  qemu: pop packet
  qemu: copy to tap
  qemu: ??

At this point, qemu must enable notification again, since we have no 
notification from tap that the transmit completed.  The only alternative 
is the timer.

If we do the transmit through an extra thread, then scheduling latency 
buys us some time:

  guest: push packet, notify qemu
  qemu: disable notification
  qemu: schedule iothread
  iothread: pop packet
  iothread: copy to tap
  iothread: check for more packets
  iothread: enable notification

If tap told us when the packets were actually transmitted, life would be 
wonderful:

  guest: push packet, notify qemu
  qemu: disable notification
  qemu: pop packet
  qemu: queue on tap
  qemu: return to guest
  hardware: churn churn churn
  tap: packet is out
  iothread: check for more packets
  iothread: enable notification

-- 
error compiling committee.c: too many arguments to function

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 2:29 am

And why do we need this? Because we are in user space!

I'll continue to wait for your patch and numbers :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 2:33 am

And in case you're working on that patch, this might interest
you.  Check out the netdev thread titled "TX time stamping".
Now that we assign the tap skb with its own sk, these two scenarios
are pretty much identical.

I also noitced despite davem's threats to revert the patch, it
has now made Linus's tree :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Thursday, April 2, 2009 - 2:38 am

Why does a kernel solution not need to know when a packet is transmitted?

-- 
error compiling committee.c: too many arguments to function

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 2:41 am

Because you can install your own destructor?

I don't know what Greg did, but netback did that nasty page destructor
hack which Jeremy is trying to undo :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Thursday, April 2, 2009 - 2:43 am

So we're back to "the problem is with the kernel->user interface, not 
userspace being cursed into slowness".


-- 
error compiling committee.c: too many arguments to function

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 2:44 am

Well until you have a patch + numbers that's only an allegation :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 4:06 am

You do not need to know when the packet is copied (which I currently
do).  You only need it for zero-copy (of which I would like to support,
but as I understand it there are problems with the reliability of proper
callback (i.e. skb->destructor).

Its "fire and forget" :)

-Greg

From: Avi Kivity
Date: Thursday, April 2, 2009 - 4:59 am

It's more of a "schedule and forget" which I think brings you the win.  
The host disables notifications and schedules the actual tx work (rx 
from the host's perspective).  So now the guest and host continue 
producing and consuming packets in parallel.  So long as the guest is 
faster (due to the host being throttled?), notifications continue to be 
disabled.

If you changed your rx_isr() to process the packets immediately instead 
of scheduling, I think throughput would drop dramatically.

Mark had a similar change for virtio.  Mark?

-- 
error compiling committee.c: too many arguments to function

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 5:30 am

Yep, when the "producer::consumer" ratio is > 1, we mitigate signaling.=20
Right, that is the point. :) This is that "soft asic" thing I was
talking about yesterday.

-Greg


From: Avi Kivity
Date: Thursday, April 2, 2009 - 5:43 am

But all that has nothing to do with where the code lives, in the kernel 
or userspace.

-- 
error compiling committee.c: too many arguments to function

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 6:03 am

Agreed, but note Ive already stated that some of my boost is likely from
in-kernel, while others are unrelated design elements such as the
"soft-asic" approach (you guys dont read my 10 page emails, do you? ;).=20
I don't deny that some of my ideas could be used in userspace as well
(Credit if used would be appreciated :).

-Greg


From: Rusty Russell
Date: Thursday, April 2, 2009 - 5:13 am

But if you have a UP guest, there will *never* be another packet in the queue
at this point, since it wasn't running.

As Avi said, you can do the processing in another thread and go back to the
guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
again before the thread did for exactly this kind of reason.

While Avi's point about a "powerful enough userspace API" is probably valid,
I don't think it's going to happen.  It's almost certainly less code to put a
virtio_net server in the kernel, than it is to create such a powerful
interface (see vringfd & tap).  And that interface would have one user in
practice.

So, let's roll out a kernel virtio_net server.  Anyone?
Rusty.
--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 5:50 am

Yep, and I'll be the first to admit that my design only looks forward.=20
Its for high speed links and multi-core cpus, etc.  If you have a
uniprocessor host, the throughput would likely start to suffer with my
current strategy.  You could probably reclaim some of that throughput
(but trading latency) by doing as you are suggesting with the deferred
initial signalling.  However, it is still a tradeoff to account for the
lower-end rig.  I could certainly put a heuristic/timer on the
guest->host to mitigate this as well, but this is not my target use case
Hmm..well I was hoping to be able to work with you guys to make my
proposal fit this role.  If there is no interest in that, I hope that my
infrastructure itself may still be considered for merging (in *some*
tree, not -kvm per se) as I would prefer to not maintain it out of tree
if it can be avoided.  I think people will find that the new logic
touches very few existing kernel lines at all, and can be completely
disabled with config options so it should be relatively inconsequential
to those that do not care.

-Greg


From: Gregory Haskins
Date: Thursday, April 2, 2009 - 5:52 am

To clarify, I am referring to the internal design of the venet-tap
only.  The general vbus architecture makes no such policy decisions.

-Greg

From: Avi Kivity
Date: Thursday, April 2, 2009 - 6:07 am

The problem is that we already have virtio guest drivers going several 
kernel versions back, as well as Windows drivers.  We can't keep 
changing the infrastructure under people's feet.


-- 
error compiling committee.c: too many arguments to function

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 6:22 am

That doesnt make sense to me, tho.  All the testing I did was a UP
guest, actually.  Why would I be constrained to run without the

Well, IIUC the virtio code itself declares the ABI as unstable, so there
technically *is* an out if we really wanted one.  But I certainly
understand the desire to not change this ABI if at all possible, and
thus the resistance here.

However, theres still the possibility we can make this work in an ABI
friendly way with cap-bits, or other such features.  For instance, the
virtio-net driver could register both with pci and vbus-proxy and
instantiate a device with a slightly different ops structure for each or
something.  Alternatively we could write a host-side shim to expose vbus
devices as pci devices or something like that.



From: Avi Kivity
Date: Thursday, April 2, 2009 - 6:27 am

Sounds complicated...

-- 
error compiling committee.c: too many arguments to function

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 7:05 am

Well, the first solution would be relatively trivial...at least on the
guest side.  All the other infrastructure is done and included in the
series I sent out.  The changes to the virtio-net driver on the guest
itself would be minimal.  The bigger effort would be converting
venet-tap to use virtio-ring instead of IOQ.  But this would arguably be
less work than starting a virtio-net backend module from scratch because
you would have to not only code up the entire virtio-net backend, but
also all the pci emulation and irq routing stuff that is required (and
is already done by the vbus infrastructure).  Here all the major pieces
are in place, just the xmit and rx routines need to be converted to
virtio-isms.

For the second option, I agree.  Its probably too nasty and it would be
better if there was just either a virtio-net to kvm-host hack, or a more
pci oriented version of a vbus-like framework.

That said, there is certainly nothing wrong with having an alternate
option.  There is plenty of precedent for having different drivers for
different subsystems, etc, even if there is overlap.  Heck, even KVM has
realtek, e1000, and virtio-net, etc.  Would our kvm community be willing
to work with me to get these patches merged?  I am perfectly willing to
maintain them.  That said, the general infrastructure should probably
not live in -kvm (perhaps -tip, -mm, or -next, etc is more
appropriate).  So a good plan might be to shoot for the core going into
a more general upstream tree.  When/if that happens, then the kvm
community could consider the kvm specific parts, etc.  I realize this is
all pending review acceptance by everyone involved...

-Greg



From: Herbert Xu
Date: Thursday, April 2, 2009 - 7:50 am

Going off on a tangent here, I don't really think it should matter
whether we're UP or SMP.  The ideal state is where we have the
same number of (virtual) TX queues as there are cores in the guest.
On the host side we need the backend to run at least on a core
that shares cache with the corresponding guest queue/core.  If
that happens to be the same core as the guest core then it should
work as well.


Yes I agree that changing the guest-side driver is a no-no.  However,
we should be able to achieve what's shown here without modifying the
guest-side.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Thursday, April 2, 2009 - 8:00 am

Good point - if we rely on having excess cores in the host, large guest 
scalability will drop.

-- 
error compiling committee.c: too many arguments to function

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 8:40 am

Going back to TX mitigation, I wonder if we could avoid it altogether
by having a "wakeup" mechanism that does not involve a vmexit.  We
have two cases:

1) UP, or rather guest runs on the same core/hyperthread as the
backend.  This is the easy one, the guest simply sets a marker
in shared memory and keeps going until its time is up.  Then the
backend takes over, and uses a marker for notification too.

The markers need to be interpreted by the scheduler so that it
knows the guest/backend is runnable, respectively.

2) The guest and backend runs on two cores/hyperthreads.  We'll
assume that they share caches as otherwise mitigation is the last
thing to worry about.  We use the same marker mechanism as above.
The only caveat is that if one core/hyperthread is idle, its
idle thread needs to monitor the marker (this would be a separate
per-core marker) to wake up the scheduler.

CCing Ingo so that he can flame me if I'm totally off the mark.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Thursday, April 2, 2009 - 8:57 am

Let's look at this first.

What if the guest sends N packets, then does some expensive computation 
(say the guest scheduler switches from the benchmark process to 
evolution).  So now we have the marker set at packet N, but the host 
will not see it until the guest timeslice is up?

I think I totally misunderstood you.  Can you repeat in smaller words?

-- 
error compiling committee.c: too many arguments to function

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 9:09 am

Well that's fine.  The guest will use up the remainder of its
timeslice.  After all we only have one core/hyperthread here so
this is no different than if the packets were held up higher up
in the guest kernel and the guest decided to do some computation.

Once its timeslice completes the backend can start plugging away
at the backlog.

Of course it would be better to put the backend on another core
that shares the cache or a hyperthread on the same core.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Thursday, April 2, 2009 - 9:54 am

3ms latency for ping?

(ping will always be scheduled immediately when the reply arrives if I 
understand cfs, so guest load won't delay it)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 10:06 am

That only happens if the guest immediately does some CPU-intensive
computation 3ms and assuming its timeslice lasts that long.

In any case, the same thing will happen right now if the host or
some other guest on the same CPU hogs the CPU for 3ms.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 10:17 am

Even better, look at the packet's TOS.  If it's marked for low-
latency then vmexit immediately.  Otherwise continue.

In the backend you'd just set the marker in shared memory.

Of course invert this for the host => guest direction.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Friday, April 3, 2009 - 5:25 am

If the host is overloaded, that's fair.  But millisecond latencies 
without host contention is not a good result.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Michael S. Tsirkin
Date: Thursday, April 2, 2009 - 8:10 am

BTW, whatever approach is chosen, to enable zero-copy transmits, it seems that
we still must add tracking of when the skb has actually been transmitted, right?

Rusty, I think this is what you did in your patch from 2008 to add destructor
for skb data ( http://kerneltrap.org/mailarchive/linux-netdev/2008/4/18/1464944 ):
and it seems that it would make zero-copy possible - or was there some problem with
that approach? Do you happen to remember?

-- 
MST
--

From: Jeremy Fitzhardinge
Date: Thursday, April 2, 2009 - 9:43 pm

I'm planning on resurrecting it to replace the page destructor used by 
Xen netback.

    J

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 3:55 am

Now you are making my point ;)  This is part of the cost of your
signaling path, and it directly adds to your latency time.   You can't
buffer packets here if the guest is only going to send one and wait for
a response and expect that to perform well.  And this is precisely what
drove me to look at avoiding going back to userspace in the first place.

-Greg

From: Avi Kivity
Date: Thursday, April 2, 2009 - 4:48 am

It adds a microsecond.  The kvm overhead of putting things in userspace 
is low enough, I don't know why people keep mentioning it.  The problem 

We're not buffering any packets.  What we lack is a way to tell the 
guest that we're done processing all packets in the ring (IOW, re-enable 
notifications).

-- 
error compiling committee.c: too many arguments to function

--

From: Gerd Hoffmann
Date: Friday, April 3, 2009 - 3:58 am

I didn't look at virtio-net very closely yet.  I wonder why the
notification is that a big issue though.  It is easy to keep the number
of notifications low without increasing latency:

Check shared ring status when stuffing a request.  If there are requests
not (yet) consumed by the other end there is no need to send a
notification.  That scheme can even span multiple rings (nics with rx
and tx for example).

Host backend can put a limit on the number of requests it takes out of
the queue at once.  i.e. block backend can take out some requests, throw
them at the block layer, check whenever any request in flight is done,
if so send back replies, start over again.  guest can put more requests
into the queue meanwhile without having to notify the host.  I've seen
the number of notifications going down to zero when running disk
benchmarks in the guest ;)

Of course that works best with one or more I/O threads, so the vcpu
doesn't has to stop running anyway to get the I/O work done ...

cheers,
  Gerd
--

From: Avi Kivity
Date: Friday, April 3, 2009 - 4:03 am

If the host is able to consume a request immediately, and the guest is 
not able to batch requests, this breaks down.  And that is the current 
situation.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Herbert Xu
Date: Friday, April 3, 2009 - 4:12 am

Hang on, why is the host consuming the request immediately? It
has to write the packet to tap, which then calls netif_rx_ni so
it should actually go all the way, no?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Friday, April 3, 2009 - 4:46 am

The host writes the packet to tap, at which point it is consumed from 
its point of view.  The host would like to mention that if there was an 
API to notify it when the packet was actually consumed, then it would 
gladly use it.  Bonus points if this involves not copying the packet.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Herbert Xu
Date: Friday, April 3, 2009 - 4:48 am

We're using write(2) for this, no? That should invoke netif_rx_ni
which blocks until the packet is "processed", which usually means
that it's placed on the NIC's hardware queue.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Friday, April 3, 2009 - 4:54 am

It doesn't copy and queue the packet?  We use O_NONBLOCK and poll() so 
we can tell when we can queue without blocking.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Herbert Xu
Date: Friday, April 3, 2009 - 4:55 am

Well netif_rx queues the packet, but netif_rx_ni is netif_rx plus
an immediate flush.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Friday, April 3, 2009 - 5:02 am

But it flushes the tap device, the packet still has to go through the 
bridge + real interface?

Even if it's queued there, I want to know when the packet is on the 
wire, not on some random software or hardware queue in the middle.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Herbert Xu
Date: Friday, April 3, 2009 - 6:05 am

Which under normal circumstances should occur before netif_rx_ni
returns.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Andi Kleen
Date: Friday, April 3, 2009 - 4:18 am

> Check shared ring status when stuffing a request.  If there are requests

That means you're bouncing cache lines all the time. Probably not a big
issue on single socket but could be on larger systems.

-Andi

--

From: Herbert Xu
Date: Friday, April 3, 2009 - 4:34 am

If the backend is running on a core that doesn't share caches
with the guest queue then you've got bigger problems.

Right this is unavoidable for guests with many CPUs but that
should go away once we support multiqueue in virtio-net.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Friday, April 3, 2009 - 4:46 am

That's why I'd like requests to be handled on the vcpu thread rather 
than an auxiliary thread.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Friday, April 3, 2009 - 4:28 am

FWIW: I employ this scheme.  The shm-signal construct has a "dirty" and
"pending" flag (all on the same cacheline, which may or may not address
Andi's later point).  The first time you dirty the shm, it sets both
flags.  The consumer side has to clear "pending" before any subsequent
signals are sent.  Normally the consumer side will also clear "enabled"
(as part of the bidir napi thing) to further disable signals.

-Greg



From: Gregory Haskins
Date: Thursday, April 2, 2009 - 3:46 am

Ok, then to be more specific, I need it to be more generic than it
already is.  For instance, I need it to be able to integrate with
shm_signals.  If we can do that without breaking the existing ABI, that
would be great!  Last I looked, it was somewhat entwined here so I didnt
try...but I admit that I didnt try that hard since I already had the IOQ

"exit mitigation' schemes are for bandwidth, not latency.  For latency
it all comes down to how fast you can signal in both directions.  If
someone is going to do a stand-alone request-reply, its generally always
going to be at least one hypercall and one rx-interrupt.  So your speed
will be governed by your signal path, not your buffer bandwidth.

What Ive done is shown that you can use techniques other than buffering
the head of the queue to do exit mitigation for bandwidth, while still
maintaining a very short signaling path for latency.  And I also argue
that the latter will always be optimal in the kernel, though I know by
which degree is still TBD.  Anthony thinks he can make the difference
negligible, and I would love to see it but am skeptical.

-Greg



From: Avi Kivity
Date: Thursday, April 2, 2009 - 4:43 am

The userspace path is longer by 2 microseconds (for two additional 
heavyweight exits) and a few syscalls.  I don't think that's worthy of 
putting all the code in the kernel.

-- 
error compiling committee.c: too many arguments to function

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 5:22 am

Well, shm_signals is what I designed to be the event mechanism for vbus
devices.  One of the design criteria of shm_signal is that it should
support a variety of environments, such as kvm, but also something like
userspace apps.  So I cannot make assumptions about things like "pci
interrupts", etc.

So if I want to use it in vbus, virtio-ring has to be able to use them,
as opposed to what it does today. Part of this would be a natural fit
for the "kick()" callback in virtio, but there are other problems.  For
one, virtio-ring (IIUC) does its own event-masking directly in the
virtio metadata.  However, really I want the higher layer ring-overlay
to do its masking in terms of the lower-layered shm_signal in order to
work the way I envision this stuff.  If you look at the IOQ
implementation, this is exactly what it does.

To be clear, and Ive stated this in the past: venet is just an example
of this generic, in-kernel concept.  We plan on doing much much more
with all this.  One of the things we are working on is have userspace
clients be able to access this too, with an ultimately goal of
supporting things like having guest-userspace doing bypass, rdma, etc.=20
We are not there yet, though...only the kvm-host to guest kernel is
currently functional and is thus the working example.

I totally "get" the attraction to doing things in userspace.  Its
contained, naturally isolated, easily supports migration, etc.  Its also
a penalty.  Bare-metal userspace apps have a direct path to the kernel
IO.  I want to give guest the same advantage.  Some people will care
more about things like migration than performance, and that is fine.=20
But others will certainly care more about performance, and that is what

By your own words, the exit to userspace is "prohibitively expensive",
so that is either true or its not.  If its 2 microseconds, show me.  We
need the rtt time to go from a "kick" PIO all the way to queue a packet
on the egress hardware and return.  That is going to define ...
From: Avi Kivity
Date: Thursday, April 2, 2009 - 5:42 am

virtio doesn't make these assumptions either.  The only difference I see 

In user/test/x86/vmexit.c, change 'cpuid' to 'out %al, $0'; drop the 
printf() in kvmctl.c's test_outb().

I get something closer to 4 microseconds, but that's on a two year old 
machine;  It will be around two on Nehalems.

My 'prohibitively expensive' is true only if you exit every packet.



-- 
error compiling committee.c: too many arguments to function

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 5:54 am

Understood, but yet you need to do this if you want something like iSCSI
READ transactions to have as low-latency as possible.

-Greg


From: Avi Kivity
Date: Thursday, April 2, 2009 - 6:08 am

Dunno, two microseconds is too much?  The wire imposes much more.

-- 
error compiling committee.c: too many arguments to function

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 6:36 am

No, but thats not what we are talking about.  You said signaling on
every packet is prohibitively expensive.  I am saying signaling on every
packet is required for decent latency.  So is it prohibitively expensive
or not?

I think most would agree that adding 2us is not bad, but so far that is
an unproven theory that the IO path in question only adds 2us.   And we
are not just looking at the rate at which we can enter and exit the
guest...we need the whole path...from the PIO kick to the dev_xmit() on
the egress hardware, to the ingress and rx-injection.  This includes any
and all penalties associated with the path, even if they are imposed by
something like the design of tun-tap.

Right now its way way way worse than 2us.  In fact, at my last reading
this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
maintaining line-rate) and I will be impressed.  Heck, shorten it to
80us and I will be impressed.

-Greg

From: Avi Kivity
Date: Thursday, April 2, 2009 - 6:45 am

We're heading dangerously into the word-game area.  Let's not do that.

If you have a high throughput workload with many packets per seconds 
then an exit per packet (whether to userspace or to the kernel) is 
expensive.  So you do exit mitigation.  Latency is not important since 
the packets are going to sit in the output queue anyway.

If you have a request-response workload with the wire idle and latency 
critical, then there's no problem having an exit per packet because (a) 
there aren't that many packets and (b) the guest isn't doing any 
batching, so guest overhead will swamp the hypervisor overhead.

If you have a low latency request-response workload mixed with a high 
throughput workload, then you aren't going to get low latency since your 
low latency packets will sit on the queue behind the high throughput 
packets.  You can fix that with multiqueue and then you're back to one 

Correct, we need to look at the whole path.  That's why the wishing well 

The 3060us thing is a timer, not cpu time.  We aren't starting a JVM for 
each packet.  We could remove it given a notification API, or 
duplicating the sched-and-forget thing, like Rusty did with lguest or 
Mark with qemu.

-- 
error compiling committee.c: too many arguments to function

--

From: Gregory Haskins
Date: Thursday, April 2, 2009 - 7:24 am

Agreed.  virtio-net currently does this with batching.  I do with the
bidir napi thing (which effectively crosses the producer::consumer > 1
Right, so the trick is to use an algorithm that adapts here.  Batching
solves the first case, but not the second.  The bidir napi thing solves
both, but it does assume you have ample host processing power to run the
algorithm concurrently.  This may or may not be suitable to all
Agreed, and thats ok.  Now we are getting more into 802.1p type MQ
Agreed, but its still "state of the art" from an observer perspective.=20
The reason "why", though easily explainable, is inconsequential to most
people.  FWIW, I have seen virtio-net do a much more respectable 350us
Heh...it kind of feels like that right now, so hopefully some
improvement will at least be on the one thing that comes out of all this.=


-Greg

From: Avi Kivity
Date: Thursday, April 2, 2009 - 7:32 am

The alternative is to get a notification from the stack that the packet 
is done processing.  Either an skb destructor in the kernel, or my new 

All I want is the notification, and the timer is headed into the nearest 
landfill.


-- 
error compiling committee.c: too many arguments to function

--

From: Avi Kivity
Date: Thursday, April 2, 2009 - 7:41 am

btw, my new api is


   io_submit(..., nr, ...): submit nr packets
   io_getevents(): complete nr packets

-- 
error compiling committee.c: too many arguments to function

--

From: Anthony Liguori
Date: Thursday, April 2, 2009 - 7:49 am

I don't think we even need that to end this debate.  I'm convinced we 
have a bug somewhere.  Even disabling TX mitigation, I see a ping 
latency of around 300ns whereas it's only 50ns on the host.  This defies 
logic so I'm now looking to isolate why that is.

Regards,

Anthony Liguori

--

From: Anthony Liguori
Date: Thursday, April 2, 2009 - 9:09 am

I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes were 
the big winner... I hate qemu sometimes.

I'm pretty confident I can get at least to Greg's numbers with some 
poking.  I think I understand why he's doing better after reading his 
patches carefully but I also don't think it'll scale with many guests 
well...  stay tuned.

But most importantly, we are darn near where vbus is with this patch wrt 
added packet latency and this is totally from userspace with no host 
kernel changes.

So no, userspace is not the issue.

Regards,


From: Avi Kivity
Date: Thursday, April 2, 2009 - 9:19 am

The way I read it, it will run only run slowly once per page, then 
settle to a cache miss per page.

Regardless, it makes a memslot model even more attractive.


-- 
error compiling committee.c: too many arguments to function

--

From: Anthony Liguori
Date: Thursday, April 2, 2009 - 11:18 am

UDP_RR test was limited by CPU consumption.  QEMU was pegging a CPU with 
only about 4000 packets per second whereas the host could do 14000.  An 
oprofile run showed that phys_page_find/cpu_physical_memory_rw where at 
the top by a wide margin which makes little sense since virtio is zero 
copy in kvm-userspace today.

That leaves the ring queue accessors that used ld[wlq]_phys and friends 
that happen to make use of the above.  That led me to try this terrible 
hack below and low and beyond, we immediately jumped to 10000 pps.  This 
only works because almost nothing uses ld[wlq]_phys in practice except 
for virtio so breaking it for the non-RAM case didn't matter.

We didn't encounter this before because when I changed this behavior, I 
tested streaming and ping.  Both remained the same.  You can only expose 
this issue if you first disable tx mitigation.

Anyway, if we're able to send this many packets, I suspect we'll be able 
to also handle much higher throughputs without TX mitigation so that's 
what I'm going to look at now.

Regards,

Anthony Liguori
--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 6:11 pm

Awesome! I'm prepared to eat my words :)

On the subject of TX mitigation, can we please set a standard
on how we measure it? For instance, do we bind the the backend
qemu to the same CPU as the guest, or do we bind it to a different
CPU that shares cache? They're two completely different scenarios
and I think we should be explicit about which one we're measuring.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Alex Williamson
Date: Monday, April 20, 2009 - 11:02 am

Anthony,

Any news on this?  I'm anxious to see virtio-net performance on par with
the virtual-bus results.  Thanks,

Alex

--

From: Gregory Haskins
Date: Friday, April 3, 2009 - 5:03 am

[ Ive already said this privately to Anthony on IRC, but ..]

Hey, congrats!  Thats impressive actually.

So I realize that perhaps you guys are not quite seeing my long term
vision here, which I think will offer some new features that we dont
have today.  I hope to change that over the coming weeks.  However, I
should also point out that perhaps even if, as of right now, my one and
only working module (venet-tap) were all I could offer, it does give us
a "rivalry" position between the two, and this historically has been a
good thing on many projects.  This helps foster innovation through
competition that potentially benefits both.  Case in point, a little
competition provoked an investigation that brought virtio-net's latency
down from 3125us to 90us.  I realize its not a production-ready patch
quite yet, but I am confident Anthony will find something that is
suitable to checkin very soon.  That's a huge improvement to a problem
that was just sitting around unnoticed because there was nothing to
compare it with.

So again, I am proposing for consideration of accepting my work (either
in its current form, or something we agree on after the normal review
process) not only on the basis of the future development of the
platform, but also to keep current components in their running to their
full potential.  I will again point out that the code is almost
completely off to the side, can be completely disabled with config
options, and I will maintain it.  Therefore the only real impact is to
people who care to even try it, and to me.

-Greg

From: Avi Kivity
Date: Friday, April 3, 2009 - 5:15 am

Your work is a whole stack.  Let's look at the constituents.

- a new virtual bus for enumerating devices.

Sorry, I still don't see the point.  It will just make writing drivers 
more difficult.  The only advantage I've heard from you is that it gets 
rid of the gunk.  Well, we still have to support the gunk for non-pv 
devices so the gunk is basically free.  The clean version is expensive 
since we need to port it to all guests and implement exciting features 
like hotplug.

- finer-grained point-to-point communication abstractions

Where virtio has ring+signalling together, you layer the two.  For 
networking, it doesn't matter.  For other applications, it may be 
helpful, perhaps you have something in mind.

- your "bidirectional napi" model for the network device

virtio implements exactly the same thing, except for the case of tx 
mitigation, due to my (perhaps pig-headed) rejection of doing things in 
a separate thread, and due to the total lack of sane APIs for packet 
traffic.

- a kernel implementation of the host networking device

Given the continuous rejection (or rather, their continuous 
non-adoption-and-implementation) of my ideas re zerocopy networking aio, 
that seems like a pragmatic approach.  I wish it were otherwise.

- a promise of more wonderful things yet to come

Obviously I can't evaluate this.

Did I miss anything?

Right now my preferred course of action is to implement a prototype 
userspace notification for networking.  Second choice is to move the 
host virtio implementation into the kernel.  I simply don't see how the 
rest of the stack is cost effective.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Friday, April 3, 2009 - 6:13 am

My real objection to PCI is fast-path related.  I don't object, per se,
to using PCI for discovery and hotplug.  If you use PCI just for these
types of things, but then allow fastpath to use more hypercall oriented
primitives, then I would agree with you.  We can leave PCI emulation in
user-space, and we get it for free, and things are relatively tidy.

Its once you start requiring that we stay ABI compatible with something
like the existing virtio-net in x86 KVM where I think it starts to get
ugly when you try to move it into the kernel.  So that is what I had a
real objection to.  I think as long as we are not talking about trying
to make something like that work, its a much more viable prospect.

So what I propose is the following:=20

1) The core vbus design stays the same (or close to it)
2) the vbus-proxy and kvm-guest patch go away
3) the kvm-host patch changes to work with coordination from the
userspace-pci emulation for things like MSI routing
4) qemu will know to create some MSI shim 1:1 with whatever it
instantiates on the bus (and can communicate changes
5) any drivers that are written for these new PCI-IDs that might be
present are allowed to use a hypercall ABI to talk after they have been
probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
access methods).

Once I get here, I might have greater clarity to see how hard it would
make to emulate fast path components as well.  It might be easier than I
think.

This is all off the cuff so it might need some fine tuning before its
actually workable.


Yeah, actually.  Thanks for bringing that up.

So the reason why signaling and the ring are distinct constructs in the
design is to facilitate constructs other than rings.  For instance,
there may be some models where having a flat shared page is better than
a ring.  A ring will naturally preserve all values in flight, where as a
flat shared page would not (last update is always current).  There are
some algorithms where a previously posted ...
From: Avi Kivity
Date: Friday, April 3, 2009 - 6:37 am

I don't see why the fast path of virtio-net would be bad.  Can you 
elaborate?


Sorry, I still don't see what advantage this has over PCI, and how you 


The way we'd to it with virtio is to add a feature bit that say "you can 
hypercall here instead of pio".  This way old drivers continue to work.

Note that nothing prevents us from trapping pio in the kernel (in fact, 
we do) and forwarding it to the device.  It shouldn't be any slower than 

The vbus part (I assume you mean device enumeration) worries me.  I 
don't think you've yet set down what its advantages are.  Being pure and 
clean doesn't count, unless you rip out PCI from all existing installed 


You keep falling into the paravirtualize the entire universe trap.  If 
you look deep down, you can see Jeremy struggling in there trying to 
bring dom0 support to Linux/Xen.

The lapic is a huge ball of gunk but ripping it out is a monumental job 
with no substantial benefits.  We can at much lower effort avoid the EOI 
trap by paravirtualizing that small bit of ugliness.  Sure the result 
isn't a pure and clean room implementation.  It's a band aid.  But I'll 
take a 50-line band aid over a 3000-line implementation split across 
guest and host, which only works with Linux.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gregory Haskins
Date: Friday, April 3, 2009 - 9:28 am

At the very least, PIOs are slightly slower than hypercalls.  Perhaps
not enough to care, but the last time I measured them they were slower,
and therefore my clean slate design doesn't use them.

But I digress.  I think I was actually kind of agreeing with you that we


I think you are confusing the vbus-proxy (guest side) with the vbus
backend.  (1) is saying "keep the vbus backend'" and (2) is saying drop
the guest side stuff.  In this proposal, the guest would speak a PCI ABI
as far as its concerned.  Devices in the vbus backend would render as

Well, if the device model was an object in vbus down in the kernel, yet
PCI emulation was up in qemu, presumably we would want something to
handle things like PCI config-cycles up in userspace.  Like, for
instance, if the guest re-routes the MSI.  The shim/proxy would handle
the config-cycle, and then turn around and do an ioctl to the kernel to
configure the change with the in-kernel device model (or the irq
infrastructure, as required).

But, TBH, I haven't really looked into whats actually required to make

Yep, agreed.  This is what I was thinking we could do.  But now that I
have the possibility that I just need to write a virtio-vbus module to
Sure, its just slightly slower, so I would prefer pure hypercalls if at

No, you are confusing the front-end and back-end again ;)

The back-end remains, and holds the device models as before.  This is
the "vbus core".  Today the front-end interacts with the hypervisor to
render "vbus" specific devices.  The proposal is to eliminate the
front-end, and have the back end render the objects on the bus as PCI
devices to the guest.  I am not sure if I can make it work, yet.  It

You are being overly dramatic.  No one has ever said we are talking
about ripping something out.  In fact, I've explicitly stated that PCI
can coexist peacefully.    Having more than one bus in a system is
certainly not without precedent (PCI, scsi, usb, etc).

Rather, PCI is PCI, and will always be.  PCI was ...
From: Avi Kivity
Date: Sunday, April 5, 2009 - 3:00 am

One thing I thought of trying to get this generic is to use file 
descriptors as irq handles.  So:

- userspace exposes a PCI device (same as today)
- guest configures its PCI IRQ (using MSI if it supports it)
- userspace handles this by calling KVM_IRQ_FD which converts the irq to 
a file descriptor
- userspace passes this fd to the kernel, or another userspace process
- end user triggers guest irqs by writing to this fd

We could do the same with hypercalls:

- guest and host userspace negotiate hypercall use through PCI config space
- userspace passes an fd to the kernel
- whenever the guest issues an hypercall, the kernel writes the 
arguments to the fd

It seems to me this already exists, it's the qemu device model.

The host kernel doesn't need any knowledge of how the devices are 





Given that we have PCI, why would we do an alternative?

It works, it works with Windows, the nasty stuff is in userspace.  Why 

The kernel need know nothing about PCI, so I don't see how you work this 

You've stated it, but failed to provide arguments for it.


-- 
error compiling committee.c: too many arguments to function

--

From: Chris Wright
Date: Wednesday, April 1, 2009 - 1:40 pm

And more stuff in the kernel can come at the potential cost of weakening
protection/isolation.
--

From: Herbert Xu
Date: Wednesday, April 1, 2009 - 8:11 pm

Protection/isolation always comes at a cost.  Not everyone wants
to pay that, just like health insurance :) We should enable the
users to choose which model they want, based on their needs.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Wednesday, April 1, 2009 - 8:09 pm

I'm sorry but I totally disagree with that.  By having our IO
infrastructure in user-space we've basically given up the main
advantage of kvm, which is that the physical drivers operate in
the same environment as the hypervisor.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Wednesday, April 1, 2009 - 11:46 pm

I don't understand this.  If we had good interfaces, all that userspace 
would do is translate guest physical addresses to host physical 
addresses, and translate the guest->host protocol to host API calls.  I 
don't see anything there that benefits from being in the kernel.

Can you elaborate?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 1:54 am

I think Greg has expressed it clearly enough.

At the end of the day, the numbers speak for themselves.  So if
and when there's a user-space version that achieves the same or
better results, then I will change my mind :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Avi Kivity
Date: Thursday, April 2, 2009 - 2:03 am

Like Anthony said, the problem is with the kernel->user interfaces.  We 
won't have a good user space virtio implementation until that is fixed.

-- 
error compiling committee.c: too many arguments to function

--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 2:05 am

If it's just the interface that's bad, then it should be possible
to do a proof-of-concept patch to show that this is the case.

Even if we have to redesign the interface, at least you can then
say that you guys were right all along :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 1:29 pm

The only events explicitly supported by the infrastructure of this
nature would be device-add and device-remove.  So when an admin adds or
removes a device to a bus, the guest would see driver::probe() and
driver::remove() callbacks, respectively.  All other events are left (by
design) to be handled by the device ABI itself, presumably over the
provided shm infrastructure.

So for instance, I have on my todo list to add a third shm-ring for
events in the venet ABI.   One of the event-types I would like to
support is LINK_UP and LINK_DOWN.  These events would be coupled to the
administrative manipulation of the "enabled" attribute in sysfs.  Other
event-types could be added as needed/appropriate.

I decided to do it this way because I felt it didn't make sense for me
to expose the attributes directly, since they are often back-end
specific anyway.   Therefore I leave it to the device-specific ABI which

Ah, good question.  This ties into the statement I made earlier about
how presumably the administrative agent would know what a module is and
how it works.  As part of this, they would also handle any kind of
additional work, such as wiring the backend up.  Here is a script that I
use for testing that demonstrates this:

------------------
#!/bin/bash

set -e

modprobe venet-tap
mount -t configfs configfs /config

bridge=3Dvbus-br0

brctl addbr $bridge
brctl setfd $bridge 0
ifconfig $bridge up

createtap()
{
    mkdir /config/vbus/devices/$1-dev
    echo venet-tap > /config/vbus/devices/$1-dev/type
    mkdir /config/vbus/instances/$1-bus
    ln -s /config/vbus/devices/$1-dev /config/vbus/instances/$1-bus
    echo 1 > /sys/vbus/devices/$1-dev/enabled

    ifname=3D$(cat /sys/vbus/devices/$1-dev/ifname)
    ifconfig $ifname up
    brctl addif $bridge $ifname
}

createtap client
createtap server

--------------------

This script creates two buses ("client-bus" and "server-bus"),
instantiates a single venet-tap on each of them, and then "wires" them
together ...
From: Andi Kleen
Date: Wednesday, April 1, 2009 - 3:23 pm

Ok so you rely on a transaction model where everything is set up
before it is somehow comitted to the guest? I hope that is made

The usual problem with that is permissions. Just making qemu-ifup suid



Not only because of blocking, but also because of security issues.
After all one of the usual reasons to run a guest is security isolation.

In general the more powerful the guest API the more risky it is, so some
self moderation is probably a good thing.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 4:05 pm

Well, its not an explicit transaction model, but I guess you could think
of it that way.

Generally you set the device up before you launch the guest.  By the
time the guest loads and tries to scan the bus for the initial
discovery, all the devices would be ready to go.

This does bring up the question of hotswap.  Today we fully support
hotswap in and out, but leaving this "enabled" transaction to the
individual device means that the device-id would be visible in the bus
namespace before the device may want to actually communicate.  Hmmm

Perhaps I need to build this in as a more explicit "enabled"
feature...and the guest will not see the driver::probe() until this happe=

Well, its kind of out of my control.  venet-tap ultimately creates a
simple netif interface which we must do something with.  Once its
created, "wiring" it up to something like a linux-bridge is no different
than something like a tun-tap, so the qemu-ifup requirement doesn't chang=
e.

The one thing I can think of is it would be possible to build a
"venet-switch" module, and this could be done without using brctl or
qemu-ifup...but then I would lose all the benefits of re-using that
infrastructure.  I do not recommend we actually do this, but it would
Oh yeah, totally agreed.  Not that I am advocating this, because I have
abandoned the idea.  But back when I was thinking of this, I would have
addressed the security with the vbus and syscall-proxy-device objects
themselves.  E.g. if you dont instantiate a syscall-proxy-device on the
bus, the guest wouldnt have access to syscalls at all.   And you could
put filters into the module to limit what syscalls were allowed, which

-Greg

From: Rusty Russell
Date: Tuesday, March 31, 2009 - 11:08 pm

That rtt time is awful.  I know the notification suppression heuristic
in qemu sucks.

I could dig through the code, but I'll ask directly: what heuristic do
you use for notification prevention in your venet_tap driver?

As you point out, 350-450 is possible, which is still bad, and it's at least
partially caused by the exit to userspace and two system calls.  If virtio_net

At some point, the copying will hurt you.  This is fairly easy to avoid on
xmit tho.

Cheers,
Rusty.
--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 4:35 am

I am not 100% sure I know what you mean with "notification prevention",
but let me take a stab at it.

So like most of these kinds of constructs, I have two rings (rx + tx on
the guest is reversed to tx + rx on the host), each of which can signal
in either direction for a total of 4 events, 2 on each side of the
connection.  I utilize what I call "bidirectional napi" so that only the
first packet submitted needs to signal across the guest/host boundary.=20
E.g. first ingress packet injects an interrupt, and then does a
napi_schedule and masks future irqs.  Likewise, first egress packet does
a hypercall, and then does a "napi_schedule" (I dont actually use napi
in this path, but its conceptually identical) and masks future
hypercalls.  So thats is my first form of what I would call notification
prevention.

The second form occurs on the "tx-complete" path (that is guest->host
tx).  I only signal back to the guest to reclaim its skbs every 10
packets, or if I drain the queue, whichever comes first (note to self:
make this # configurable).

The nice part about this scheme is it significantly reduces the amount
of guest/host transitions, while still providing the lowest latency
response for single packets possible.  e.g. Send one packet, and you get
one hypercall, and one tx-complete interrupt as soon as it queues on the
hardware.  Send 100 packets, and you get one hypercall and 10
tx-complete interrupts as frequently as every tenth packet queues on the
hardware.  There is no timer governing the flow, etc.


But that is the whole point, isnt it?  I created vbus specifically as a
framework for putting things in the kernel, and that *is* one of the
major reasons it is faster than virtio-net...its not the difference in,
say, IOQs vs virtio-ring (though note I also think some of the
innovations we have added such as bi-dir napi are helping too, but these
are not "in-kernel" specific kinds of features and could probably help
the userspace version too).

I would be entirely happy ...
From: Rusty Russell
Date: Wednesday, April 1, 2009 - 6:24 pm

Good stab, though I was referring to guest->host signals (I'll assume
you use a similar scheme there).

You use a number of packets, qemu uses a timer (150usec), lguest uses a
variable timer (starting at 500usec, dropping by 1 every time but increasing
by 10 every time we get fewer packets than last time).

So, if the guest sends two packets and stops, you'll hang indefinitely?
That's why we use a timer, otherwise any mitigation scheme has this issue.

Thanks,
--

From: Gregory Haskins
Date: Wednesday, April 1, 2009 - 7:27 pm

Oh, actually no.  The guest->host path only uses the "bidir napi" thing
I mentioned.  So first packet hypercalls the host immediately with no
delay, schedules my host-side "rx" thread, disables subsequent
hypercalls, and returns to the guest.  If the guest tries to send
another packet before the time it takes the host to drain all queued
skbs (in this case, 1), it will simply queue it to the ring with no
additional hypercalls.    Like typical napi ingress processing, the host
will leave hypercalls disabled until it finds the ring empty, so this
process can continue indefinitely until the host catches up.  Once fully
drained,  the host will re-enable the hypercall channel and subsequent
transmissions will repeat the original process.

In summary, infrequent transmissions will tend to have one hypercall per
packet.  Bursty transmissions will have one hypercall per burst
(starting immediately with the first packet).  In both cases, we
minimize the latency to get the first packet "out the door".

So really the only place I am using a funky heuristic is the modulus 10
operation for tx-complete going host->guest.  The rest are kind of
Shouldn't, no.  The host will send tx-complete interrupts at *max* every
10 packets, but if it drains the queue before the modulus 10 expires, it
will send a tx-complete immediately, right before it re-enables
hypercalls.  So there is no hang, and there is no delay.

For reference, here is the modulus 10 signaling
(./drivers/vbus/devices/venet-tap.c, line 584):

http://git.kernel.org/?p=3Dlinux/kernel/git/ghaskins/vbus/linux-2.6.git;a=
=3Dblob;f=3Ddrivers/vbus/devices/venet-tap.c;h=3D0ccb7ed94a1a8edd0cca2694=
88f940f40fce20df;hb=3Dmaster#l584

Here is the one that happens after the queue is fully drained (line 593)

http://git.kernel.org/?p=3Dlinux/kernel/git/ghaskins/vbus/linux-2.6.git;a=
=3Dblob;f=3Ddrivers/vbus/devices/venet-tap.c;h=3D0ccb7ed94a1a8edd0cca2694=
88f940f40fce20df;hb=3Dmaster#l593

and finally, here is where I re-enable hypercalls ...
From: Anthony Liguori
Date: Wednesday, April 1, 2009 - 9:10 am

I doubt the userspace exit is the problem.  On a modern system, it takes 
about 1us to do a light-weight exit and about 2us to do a heavy-weight 
exit.  A transition to userspace is only about ~150ns, the bulk of the 
additional heavy-weight exit cost is from vcpu_put() within KVM.

If you were to switch to another kernel thread, and I'm pretty sure you 
have to, you're going to still see about a 2us exit cost.  Even if you 
factor in the two syscalls, we're still talking about less than .5us 
that you're saving.  Avi mentioned he had some ideas to allow in-kernel 
thread switching without taking a heavy-weight exit but suffice to say, 
we can't do that today.

You have no easy way to generate PCI interrupts in the kernel either.  
You'll most certainly have to drop down to userspace anyway for that.

I believe the real issue is that we cannot get enough information today 
from tun/tap to do proper notification prevention b/c we don't know when 
the packet processing is completed.

Regards,

Anthony Liguori
--

From: Rusty Russell
Date: Saturday, April 4, 2009 - 8:44 pm

Just to inject some facts, servicing a ping via tap (ie host->guest then
guest->host response) takes 26 system calls from one qemu thread, 7 from
another (see strace below). Judging by those futex calls, multiple context

He switches to another thread, too, but with the right infrastructure (ie.
skb data destructors) we could skip this as well.  (It'd be interesting to
see how virtual-bus performed on a single cpu host).

Cheers,
Rusty.

Pid 10260:
12:37:40.245785 select(17, [4 6 8 14 16], [], [], {0, 996000}) = 1 (in [6], left {0, 992000}) <0.003995>
12:37:40.250226 read(6, "\0\0\0\0\0\0\0\0\0\0RT\0\0224V*\211\24\210`\304\10\0E\0"..., 69632) = 108 <0.000051>
12:37:40.250462 write(1, "tap read: 108 bytes\n", 20) = 20 <0.000197>
12:37:40.250800 ioctl(7, 0x4008ae61, 0x7fff8cafb3a0) = 0 <0.000223>
12:37:40.251149 read(6, 0x115c6ac, 69632) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>
12:37:40.251292 write(1, "tap read: -1 bytes\n", 19) = 19 <0.000085>
12:37:40.251488 clock_gettime(CLOCK_MONOTONIC, {1554, 633304282}) = 0 <0.000020>
12:37:40.251604 clock_gettime(CLOCK_MONOTONIC, {1554, 633413793}) = 0 <0.000019>
12:37:40.251717 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.001222>
12:37:40.253037 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {1, 0}) <0.000026>
12:37:40.253196 read(16, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 <0.000022>
12:37:40.253324 rt_sigaction(SIGALRM, NULL, {0x406d50, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f1a842430f0}, 8) = 0 <0.000018>
12:37:40.253477 write(5, "\0", 1)       = 1 <0.000022>
12:37:40.253585 read(16, 0x7fff8cb09440, 128) = -1 EAGAIN (Resource temporarily unavailable) <0.000020>
12:37:40.253687 clock_gettime(CLOCK_MONOTONIC, {1554, 635496181}) = 0 <0.000019>
12:37:40.253798 writev(6, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"*\211\24\210`\304RT\0\0224V\10\0E\0\0T\255\262\0\0@\1G"..., 98}], 2) = 108 <0.000062>
12:37:40.253993 ioctl(7, 0x4008ae61, 0x7fff8caff460) = 0 <0.000161>
12:37:40.254263 ...
From: Avi Kivity
Date: Sunday, April 5, 2009 - 1:06 am

Interesting stuff.  Even if amortized over half a ring's worth of 
packets, that's quite a lot.

Two threads are involved (we complete on the iothread, since we don't 

Should switch to epoll with its lower wait costs.  Unfortunately the 










Looks like the interrupt from the iothread was injected and delivered 
before the iothread could give up the mutex, so we needed to wait here.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Anthony Liguori
Date: Sunday, April 5, 2009 - 7:13 am

N.B. we're not optimized for latency today.  With the right 
infrastructure in userspace, I'm confident we could get this down.

What we need is:

1) Lockless MMIO/PIO dispatch (there should be two IO registration 
interfaces, a new lockless one and the legacy one)
2) A virtio-net thread that's independent of the IO thread.

It would be interesting to count the number of syscalls required in the 
lguest path since that should be a lot closer to optimal.

Regards,

Anthony Liguori

--

From: Avi Kivity
Date: Sunday, April 5, 2009 - 9:10 am

Not sure exactly how much this is needed, since when there is no 
contention, locks are almost free (there's the atomic and cacheline 
bounce, but no syscall).

For any long operations, we should drop the lock (of course we need some 

Yes -- that saves us all the select() prologue (calculating new timeout) 
and the select() itself.



-- 
error compiling committee.c: too many arguments to function

--

From: Anthony Liguori
Date: Sunday, April 5, 2009 - 9:45 am

There should be no contention but I strongly suspect there is more often 
than we think.  The IO thread can potentially hold the lock for a very 
long period of time.  Take into consideration things like qcow2 metadata 

In an ideal world, we could do the submission via io_submit in the VCPU 
context, not worry about the copy latency (because we're zero copy).  
Then our packet transmission latency is consistently low because the 
path is consistent and lockless.  This is why dropping the lock is so 
important, it's not enough to usually have low latency.  We need to try 
and have latency as low as possible as often as possible.

Regards,


--

From: Herbert Xu
Date: Wednesday, April 1, 2009 - 8:15 pm

FWIW I don't really care whether we go with this or a kernel
virtio_net backend.  Either way should be good.  However the
status quo where we're stuck with a user-space backend really
sucks!

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

Previous thread: none

Next thread: [PATCH 1/2 V2] kaweth: Fix locking to be SMP-safe by Larry Finger on Tuesday, March 31, 2009 - 11:45 am. (7 messages)