The ease in which a new hypervisor should be able to integrate into the
stack is only one of vbus's many benefits.
No, that is incorrect. Not to be rude, but for clarity:
Complementary \Com`ple*men"ta*ry\, a.
Serving to fill out or to complete; as, complementary
numbers.
[1913 Webster]
Citation: www.dict.org
IOW: Something being complementary has nothing to do with guest/host
binary compatibility. virtio-pci and virtio-vbus are both equally
complementary to virtio since they fill in the bottom layer of the
virtio stack.
So yes, vbus is truly complementary to virtio afaict.
Binary compatibility with existing virtio drivers, while nice to have,
is not a specific requirement nor goal. We will simply load an updated
KMP/MSI into those guests and they will work again. As previously
discussed, this is how more or less any system works today. It's like
we are removing an old adapter card and adding a new one to "uprev the
silicon".
Actually I misspoke earlier when I said virtio works over non-shmem.
Thinking about it some more, both virtio and vbus fundamentally require
shared-memory, since sharing their metadata concurrently on both sides
is their raison d'être.
The difference is that virtio utilizes a pre-translation/mapping (via
->add_buf) from the guest side. OTOH, vbus uses a post translation
scheme (via memctx) from the host-side. If anything, vbus is actually
more flexible because it doesn't assume the entire guest address space
is directly mappable.
In summary, your statement is incorrect (though it is my fault for
putting that idea in your head).
Well, to be fair no one said it has to ignore them. Either virtio-vbus
transport is present and available to the virtio stack, or it isn't. If
its present, it may or may not publish objects for consumption.
Providing a virtio-vbus transport in no way limits or degrades the
existing capabilities of the virtio stack. It only enhances them.
I digress. The whole point is moot since I realized that the non-shmem
distinction isn't accurate anyway. They both require shared-memory for
the metadata, and IIUC virtio requires the entire address space to be
mappable whereas vbus only assumes the metadata is.
I actually do not have a rig setup to explicitly test inter-interrupt
rates at the moment. Once things stabilize for me, I will try to
re-gather some numbers here. Last time I looked, however, there were
some decent savings for inter as well.
Inter rates are interesting because they are what tends to ramp up with
IO load more than intra since guest interrupt mitigation techniques like
NAPI often quell intra-rates naturally. This is especially true for
data-center, cloud, hpc-grid, etc, kind of workloads (vs vanilla
desktops, etc) that tend to have multiple IO ports (multi-homed nics,
disk-io, etc). Those various ports tend to be workload-related to one
another (e.g. 3-tier web stack may use multi-homed network and disk-io
at the same time, trigged by one IO event).
An interesting thing here is that you don't even need a fancy
multi-homed setup to see the effects of my exit-ratio reduction work:
even single port configurations suffer from the phenomenon since many
devices have multiple signal-flows (e.g. network adapters tend to have
at least 3 flows: rx-ready, tx-complete, and control-events (link-state,
etc). Whats worse, is that the flows often are indirectly related (for
instance, many host adapters will free tx skbs during rx operations, so
you tend to get bursts of tx-completes at the same time as rx-ready. If
the flows map 1:1 with IDT, they will suffer the same problem.
In any case, here is an example run of a simple single-homed guest over
standard GigE. Whats interesting here is that .qnotify to .notify
ratio, as this is the interrupt-to-signal ratio. In this case, its
170047/151918, which comes out to about 11% savings in interrupt injections:
vbus-guest:/home/ghaskins # netperf -H dev
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
dev.laurelwood.net (192.168.1.10) port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
1048576 16384 16384 10.01 940.77
vbus-guest:/home/ghaskins # cat /sys/kernel/debug/pci-to-vbus-bridge
.events : 170048
.qnotify : 151918
.qinject : 0
.notify : 170047
.inject : 18238
.bridgecalls : 18
.buscalls : 12
vbus-guest:/home/ghaskins # cat /proc/interrupts
CPU0
0: 87 IO-APIC-edge timer
1: 6 IO-APIC-edge i8042
4: 733 IO-APIC-edge serial
6: 2 IO-APIC-edge floppy
7: 0 IO-APIC-edge parport0
8: 0 IO-APIC-edge rtc0
9: 0 IO-APIC-fasteoi acpi
10: 0 IO-APIC-fasteoi virtio1
12: 90 IO-APIC-edge i8042
14: 3041 IO-APIC-edge ata_piix
15: 1008 IO-APIC-edge ata_piix
24: 151933 PCI-MSI-edge vbus
25: 0 PCI-MSI-edge virtio0-config
26: 190 PCI-MSI-edge virtio0-input
27: 28 PCI-MSI-edge virtio0-output
NMI: 0 Non-maskable interrupts
LOC: 9854 Local timer interrupts
SPU: 0 Spurious interrupts
CNT: 0 Performance counter interrupts
PND: 0 Performance pending work
RES: 0 Rescheduling interrupts
CAL: 0 Function call interrupts
TLB: 0 TLB shootdowns
TRM: 0 Thermal event interrupts
THR: 0 Threshold APIC interrupts
MCE: 0 Machine check exceptions
MCP: 1 Machine check polls
ERR: 0
MIS: 0
Its important to note here that we are actually looking at the interrupt
rate, not the exit rate (which is usually a multiple of the interrupt
rate, since you have to factor in as many as three exits per interrupt
(IPI, window, EOI). Therefore we saved about 18k interrupts in this 10
second burst, but we may have actually saved up to 54k exits in the
process. This is only over a 10 second window at GigE rates, so YMMV.
These numbers get even more dramatic on higher end hardware, but I
haven't had a chance to generate new numbers yet.
Looking at some external stats paints an even bleaker picture: "exits"
as reported by kvm_stat for virtio-pci based virtio-net tip the scales
at 65k/s vs 36k/s for vbus based venet. And virtio is consuming ~30% of
my quad-core's cpu, vs 19% for venet during the test. Its hard to know
which innovation or innovations may be responsible for the entire
reduction, but certainly the interrupt-to-signal ratio mentioned above
is probably helping.
The even worse news for 1:1 models is that the ratio of
exits-per-interrupt climbs with load (exactly when it hurts the most)
since that is when the probability that the vcpu will need all three
exits is the highest.
Everyone is of course entitled to an opinion, but the industry as a
whole would disagree with you. Signal path routing (1:1, aggregated,
etc) is at the discretion of the bus designer. Most buses actually do
_not_ support 1:1 with IDT (think USB, SCSI, IDE, etc).
PCI is somewhat of an outlier in that regard afaict. Its actually a
nice feature of PCI when its used within its design spec (HW). For
SW/PV, 1:1 suffers from, among other issues, that "triple-exit scaling"
issue in the signal path I mentioned above. This is one of the many
reasons I think PCI is not the best choice for PV.
That's pure speculation. I would advise you to reserve such statements
until after a proper bakeoff can be completed. This is not to mention
that vhost-net does nothing to address our other goals, like scheduler
coordination and non-802.x fabrics.
Where doesn't it fit?
Citation please. Afaict, the one use case that we looked at for vhost
outside of KVM failed to adapt properly, so I do not see how this is true.
Compatibility with what? vhost hasn't even been officially deployed in
KVM environments afaict, nevermind non-virt. Therefore, how could it
possibly have compatibility constraints with something non-virt already?
Citation please.
vbus allows you to have 1:1 if that is what you want, but we strive to
do better.
It isn't, and I've already done that.
[1]
I specifically mentioned that already ([1]).
You are also overstating its role, since the basic OS is what implements
the native support for bus-objects, hotswap, etc, _not_ PCI. PCI just
rides underneath and feeds trivial events up, as do other bus-types
(usb, scsi, vbus, etc). And once those events are fed, you still need a
PV layer to actually handle the bus interface in a high-performance
manner so its not like you really have a "native" stack in either case.
No, that is incorrect. You have to heavily modify the pci model with
layers on top to get any kind of performance out of it. Otherwise, we
would just use realtek emulation, which is technically the native PCI
you are apparently so enamored with.
Not to mention there are things you just plain can't do in PCI today,
like dynamically assign signal-paths, priority, and coalescing, etc.
Actually IIUC, I think Xen bridges to their own bus as well (and only
where they have to), just like vbus. They don't use PCI natively. PCI
is perfectly suited as a bridge transport for PV, as I think the Xen and
vbus examples have demonstrated. Its the 1:1 device-model where PCI has
the most problems.
There was a point in time where the same could be said for virtio-pci
based drivers vs realtek and e1000, so that argument is demonstrably
silly. No one tried to make virtio work in a binary compatible way with
realtek emulation, yet we all survived the requirement for loading a
virtio driver to my knowledge.
The bottom line is: Binary device compatibility is not required in any
other system (as long as you follow sensible versioning/id rules), so
why is KVM considered special?
The fact is, it isn't special (at least not in this regard). What _is_
required is "support" and we fully intend to support these proposed
components. I assure you that at least the users that care about
maximum performance will not generally mind loading a driver. Most of
them would have to anyway if they want to get beyond realtek emulation.
I am certainly in no position to tell you how to feel, but this
declaration would seem from my perspective to be more of a means to an
end than a legitimate concern. Otherwise we would never have had virtio
support in the first place, since it was not "compatible" with previous
releases.
Making systems perform 5x faster _is_ fun, yes. I love what I do for a
living.
If and when that becomes a priority concern, that would be a function
transparently supported in the BIOS shipped with the hypervisor, and
would thus be invisible to the user.
No, you are incorrect on two counts.
1) Of course I care about pain to users or I wouldn't be funded. Right
now the pain from my perspective is caused to users in the
high-performance community who want to deploy KVM based solutions. They
are unable to do so due to its performance disparity compared to
bare-metal, outside of pass-through hardware which is not widely
available in a lot of existing deployments. I aim to fix that disparity
while reusing the existing hardware investment by writing smarter
software, and I assure you that these users won't mind loading a driver
in the guest to take advantage of it.
For the users that don't care about maximum performance, there is no
change (and thus zero pain) required. They can use realtek or virtio if
they really want to. Neither is going away to my knowledge, and lets
face it: 2.6Gb/s out of virtio to userspace isn't *that* bad. But "good
enough" isn't good enough, and I won't rest till we get to native
performance. Additionally, I want to support previously unavailable
modes of operations (e.g. real-time) and advanced fabrics (e.g. IB).
2) True pain to users is not caused by lack of binary compatibility.
Its caused by lack of support. And its a good thing or we would all be
emulating 8086 architecture forever...
..oh wait, I guess we kind of do that already ;). But at least we can
slip in something more advanced once in a while (APIC vs PIC, USB vs
uart, iso9660 vs floppy, for instance) and update the guest stack
instead of insisting it must look like ISA forever for compatibility's sake.
The user will not care where the model lives, per se. Only that it is
supported, and it works well.
Likewise, I know from experience that the developer will not like
writing the same code twice, so the "runs in both" model is not
necessarily a great design trait either.
First of all, bus-decode is substantially easier than per-device decode
(you have to track all those per-device/per-signal fds somewhere,
integrate with hotswap, etc), and its only done once per guest at
startup and left alone. So its already not apples to apples.
Second, while its true that the general kvm-connector bus-decode needs
to be programmed, that is a function of adapting to the environment
that _you_ created for me. The original kvm-connector was discovered
via cpuid and hypercalls, and didn't need userspace at all to set it up.
Therefore it would be entirely unfair of you to turn around and somehow
try to use that trait of the design against me since you yourself
imposed it.
As an additional data point, our other connectors have no such
bus-decode programming requirement. Therefore, this is clearly
just a property of the KVM environment, not a function of the overall
vbus design.
Right. And among other shortcomings it also requires a KVM-esque memory
model (which is not always going to work as we recently discussed), and
a redundant device-model to back it up in userspace, which is a
development and maintenance burden, and an external bus-model (filled by
pio-bus in KVM today).
It will with the ioctl based control interface that I'll merge shortly.
This question doesn't make sense. Hotswap control occurs on the host,
which is always Linux.
If you were asking about whether a windows guest will support hotswap:
the answer is "yes". Our windows driver presents a unique PDO/FDO pair
for each logical device instance that is pushed out (just like the built
in usb, pci, scsi bus drivers that windows supports natively).
Citation?
This is more hyperbole. I doubt that there would be many that would
argue that a modular architecture (that we get for free with LKM
support) is not desirable, even if its never used dynamically with a
running guest. OTOH, I actually use this dynamic feature all the time
as I test my components, so its at least useful to me.
I just did.
No, that is incorrect. What you are apparently not understanding is
that not only is vbus that library, but its extensible. So even if
compatibility is your goal (it doesn't need to be IMO) it can be
accommodated by how you interface to the library.
My primary objective is creating an extensible, high-performance,
shared-memory interconnect for systems that utilize a Linux host as
their IO-hub. It just so happens that virtio can sit nicely on top of
such a model because shmem-rings are a subclass of shmem. As a result
of its design, vbus also helps to reduce code duplication in the stack
for new environments due to its extensible nature.
However, vbus also has goals beyond what virtio is providing today that
are of more concern, and part of that is designing a connector/bus that
eliminates the shortcomings in the current pci-based design.
Already covered above.
Fair enough.
Even if that were true, which is debatable, do not confuse "convenient"
with "optimal". If you don't care about maximum performance and
advanced features like QOS, sure go ahead and use PCI. Why not.
No, that is incorrect. For one, vhost uses them on a per-signal path
basis, whereas vbus only has one channel for the entire guest->host.
Second, I do not use ioeventfd anymore because it has too many problems
with the surrounding technology. However, that is a topic for a
different thread.
No, that is incorrect. The amount of "work" that a guest does is
actually the same in both cases, since the guest OS peforms the hotswap
handling natively for all bus types (at least for Linux and Windows).
You still need to have a PV layer to interface with those objects in
both cases, as well, so there is no such thing as "native interface" for
PV. Its only a matter of where it occurs in the stack.
Yes, see /sys/vbus/devices/$dev/ to get per-instance attributes
The short answer is "not yet (I think)". I need to write a patch to
properly set the mode attribute in sysfs, but I think this will be trivial.
So what? If anything, it goes to show how extensible the framework is
that a new plane could be added in 119 lines of code:
~/git/linux-2.6> stg show vbus-add-admin-ioctls.patch | diffstat
Makefile | 3 -
config-ioctl.c | 117
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 119 insertions(+), 1 deletion(-)
if and when having two control planes exceeds its utility, I will submit
a simple patch that removes the useless one.
And likewise, neither does vbus.
Kind Regards,
-Greg