This patch refactors the current hypercall infrastructure to better support live
migration and SMP. It eliminates the hypercall page by trapping the UD
exception that would occur if you used the wrong hypercall instruction for the
underlying architecture and replacing it with the right one lazily.
It also introduces the infrastructure to probe for hypercall available via
CPUID leaves 0x40000002. CPUID leaf 0x40000003 should be filled out by
userspace.
A fall-out of this patch is that the unhandled hypercalls no longer trap to
userspace. There is very little reason though to use a hypercall to communicate
with userspace as PIO or MMIO can be used. There is no code in tree that uses
userspace hypercalls.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index ad08138..1cde572 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -46,6 +46,7 @@
#define KVM_MAX_CPUID_ENTRIES 40
#define DE_VECTOR 0
+#define UD_VECTOR 6
#define NM_VECTOR 7
#define DF_VECTOR 8
#define TS_VECTOR 10
@@ -317,9 +318,6 @@ struct kvm_vcpu {
unsigned long cr0;
unsigned long cr2;
unsigned long cr3;
- gpa_t para_state_gpa;
- struct page *para_state_page;
- gpa_t hypercall_gpa;
unsigned long cr4;
unsigned long cr8;
u64 pdptrs[4]; /* pae */
@@ -622,7 +620,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
int kvm_mmu_load(struct kvm_vcpu *vcpu);
void kvm_mmu_unload(struct kvm_vcpu *vcpu);
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu);
static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
u32 error_code)
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 99e4917..5211d19 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -39,6 +39,7 @@
#include <linux/smp.h>
#include <linux/anon_inodes.h>
#include ...I guess it would be pretty rude/unlikely for these opcodes to get reused
in other implementations... But couldn't you make the page trap
Is this compatible with Xen's (and other's) use of cpuid? That is,
0x40000000 returns a hypervisor-specific signature in e[bcd]x, and eax
has the max hypervisor leaf.
J
-
The whole point of using the instruction is to allow hypercalls to be used in many locations. This has the nice side effect of not requiring a central hypercall initialization routine in the guest to fetch the hypercall page. A PV driver can be completely independent of any other Xen is currently using 0/1/2. I had thought it was only using 0/1. The intention was not to squash Xen's current CPUID usage so that it would still be possible for Xen to make use of the guest code. Can we agree that Xen won't squash leaves 3/4 or is it not worth trying to be compatible at this point? Regards, -
But if the instruction is architecture dependent, and you run on the wrong architecture, now you have to patch many locations at fault time, introducing some nasty runtime code / data cache overlap performance problems. Granted, they go away eventually. I prefer the idea of a hypercall page, but not a central initialization. Rather, a decentralized approach where PV drivers can detect using CPUID which hypervisor is present, and a common MSR shared by all hypervisors that provides the location of the hypercall page. Zach -
We're addressing that by blowing away the shadow cache and holding the big kvm lock to ensure SMP safety. Not a great thing to do from a performance perspective but the whole point of patching is that the cost So then each module creates a hypercall page using this magic MSR and the hypervisor has to keep track of it so that it can appropriately change the page on migration. The page can only contain a single instruction or else it cannot be easily changed (or you have to be able to prevent the guest from being migrated while in the hypercall page). We're really talking about identical models. Instead of an MSR, the #GP is what tells the hypervisor to update the instruction. The nice thing about this is that you don't have to keep track of all the current hypercall page locations in the hypervisor. Regards, -
I agree, multiple hypercall pages is insane. I was thinking more of a single hypercall page, fixed in place by the hypervisor, not the kernel. Then each module can read an MSR saying what VA the hypercall page is at, and the hypervisor can simply flip one page to switch architectures. Zach -
VA as in "Virtual Address"? the ppc people don't have hypervisor-visible virtual addresses, and the hypervisor (on x86) can't safely select a virtual address, and ... That means you need a physical address, so you need a central initialization routine, and drivers for unmodified OSes can no longer be self contained. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -
That requires a memory hole though. In KVM, we don't have a memory hole. Regards, -
I see. So you take the fault, disassemble the instruction, see that its
another CPU's vmcall instruction, and then replace it with the current
No, the point is that you're supposed to work out which hypervisor it is
from the signature in leaf 0, and then the hypervisor can put anything
it wants in the other leaves.
J
-
Yeah, see, the initial goal was to make it possible to use the KVM paravirtualizations on other hypervisors. However, I don't think this is really going to be possible in general so maybe it's better to just use leaf 0. I'll let others chime in before sending a new patch. Regards, -
Hm. Obviously you can just define a signature for "kvm-compatible
hypercall interface" and make it common that way, but it gets tricky if
the hypervisor supports multiple hypercall interfaces, including the kvm
one. Start the kvm leaves at 0x40001000 or something?
J
-
Yeah, that works with me. Regards, -
To me this is the beginning of fragmentation. Why do we need different and VMM-specific Linux paravirtualization for hardware-assisted Jun --- Intel Open Source Technology Center -
On the contrary. Xen already has a hypercall interface, and we need to
keep supporting it. If we were to also support a vmm-independent
interface (aka "kvm interface"), then we need to be able to do that in
parallel. If we have a cpuid leaf clash, then its impossible to do so;
if we define the new interface to be disjoint from other current users
of cpuid, then we can support them concurrently.
J
-
Today, 3 CPUID leaves starting from 0x4000_0000 are defined in a generic fashion (hypervisor detection, version, and hypercall page), and those are the ones used by Xen today. We should extend those leaves (e.g. starting from 0x4000_0003) for the vmm-independent features as well. If Xen needs additional Xen-specific features, we need to allocate some leaves for those (e.g. 0x4000_1000) Jun --- Intel Open Source Technology Center -
But the signature is "XenVMMXenVMM", which isn't very generic. If we're
presenting a generic interface, it needs to have a generic signature,
otherwise guests will need to have a list of all hypervisor signatures
supporting their interface. Since 0x40000000 has already been
established as the base leaf of the hypervisor-specific interfaces, the
generic interface will have to be elsewhere.
J
-
The hypervisor detection machanism is generic, and the signature returned is implentation specific. Having a list of all hypervisor signatures sounds fine to me as we are detecting vendor-specific Jun --- Intel Open Source Technology Center -
I'm confused about what you're proposing. I was thinking that a kernel
looking for the generic hypervisor interface would check for a specific
signature at some cpuid leaf, and then go about using it from there. If
not, how does is it supposed to detect the generic hypervisor interface?
J
-
I'm suggesting that we use CPUID.0x4000000Y (Y: TBD, e.g. 6) for Linux paravirtualization. The ebx, ecx and edx return the Linux paravirtualization features available on that hypervisor. Those features are defined architecturally (not VMM specific). Like CPUID.0, CPUID.0x40000000 is used to detect the hypervisor with the vendor identification string returned in ebx, edx, and ecx (as we are doing in Xen). The eax returns the max leaf (which is 0x40000002 on Xen today). And like CPUID.1, CPUID.0x40000001 returns the version number in eax, and each VMM should be able to define a number of VMM-specific features available in ebx, ecx, and edx returned (which are reserved, i.e. not used in Xen today). Suppose we knew (i.e. tested) Xen and KVM supported Linux paravirtualization, the Linux code does: 1. detect Xen or KVM <the list> using CPUID.0x40000000 2. Check the version if necessary using CPUID.0x40000001 3. Check the Linux paravirtualization features available using CPUID.0x4000000Y. Jun --- Intel Open Source Technology Center -
I don't understand the purpose of returning the max leaf. Who is that information useful for? I like Jeremy's suggesting of starting with 0x40001000 for KVM. Xen has an established hypercall interface and that isn't going to change. However, in the future, if other Operating Systems (like the BSDs) choose to implement the KVM paravirtualization interface, then that leaves open the possibility for Xen to also support this interface to get good performance for those OSes. It's necessary to be able to support both at once if you wish to support these interfaces without user interaction. There's no tangible benefit to us to use 0x40000000. Therefore I'm inclined to lean toward making things easier for others. Regards, -
Well, this is the key info to the user of CPUID. It tells which leaves are valid to use. Otherwise, the user cannot tell whether the results of CPUID.0x4000000N are valid or not (i.e. junk). BTW, this is what we are doing on the native (for the leaf 0, 0x80000000, for example). The fact Using CPUID.0x4000000N (N > 2) does not prevent Xen from doing that, either. If you use 0x40001000, 1) you need to say the leaves from 0x40000000 through 0x40001000 are all valid, OR 2) you create/fork a Again, 0x40000000 is not Xen specific. If the leaf 0x40000000 is used for any guest to detect any hypervisor, that would be compelling benefit. For future Xen-specific features, it's safe for Xen to use other bigger leaves (like 0x40001000) because the guest starts looking at them after detection of Xen. Likewise if KVM paravirtualization interface (as kind of "open source paravirtualization interface") is detected in the generic areas (not in vender-specific), any guest can check the features available without Jun --- Intel Open Source Technology Center -
Then it's just a version ID. You pretty much have to treat it as a version id because if it returns 0x4000 0003 and you only know what 0002 is, then you can't actually use it. I much prefer the current use of CPUID in KVM. If 1000 returns the KVM signature, then 1001 *must* be valid and contain a set of feature bits. If we wish to use additional CPUID leaves in the future, then we can just use a feature bit. The real benefit to us is that we can use a discontiguous set of leaves whereas the Xen approach is forced to use a Why do 0000-1000 have to be valid? Xen is not going to change what they have today--they can't. However, if down the road, they decided that since so many guests use KVM's paravirtualization interface other than I'm starting to lean toward just using 0000. If for no other reason than the hypercall space is unsharable. Regards, -
Yeah. It's the way all the other cpuid leaf/level stuff works, so it's
reasonable to do the same thing here. The question it helps answer is
Well, its also what the CPU itself does. The feature bits tend to
relate to specific CPU features rather than CPUID instruction leaves.
The features themselves may also have corresponding leaves, but that's
secondary. IOW, if feature bit X is set, it may use leaf 0x4000101f,
Well, it could be, but it would take affirmative action on the guest's
part. If there's feature bits for each supported hypercall interface,
then you could have a magic MSR to select which interface you want to
use now. That would allow a generic-interface-using guest to probe for
the generic interface at cpuid leaf 0x40001000, use 40001001 to
determine whether the hypercall interface is available, 4000100x to find
the base of the magic msrs, and write appropriate msr to set the desired
hypercall style (and all this can be done without using vmcall, so it
doesn't matter that hypercall interface is initially established).
J
-
The main thing keeping me from doing this ATM is what I perceive as lack of interest in a generic interface. I think it's also a little premature given that we don't have any features on the plate yet. However, I don't think that means that we cannot turn KVM's PV into a generic one. So here's what I propose. Let's start building the KVM PV interface on 4000 0000. That means that Xen cannot initially use it but that's okay. Once KVM-lite is merged and we have some solid features (and other guests start implementing them), we can also advertise this interface as a "generic interface" by also supporting the signature on leave 4000 1000 and using the MSR trickery that you propose. As long as we all agree not to use 4000 1000 for now, it leaves open the possibility of having a generic interface in the future. Regards, -
I don't see a particular problem with that. If the whole 0x4xxxxxxx
range is reserved for hypervisor use, and existing hypervisors are
already using 0x400000xx in hypervisor-specific ways, then it makes
sense to start the generic stuff at 0x40001xxx (or some other offset).
But without a few more implementations of the "generic" interface its
This just seems a bit grotty. You're relying on the fact that you can
overlay Xen's current use of 0x4000000x for the generic interface by
freezing Xen's current use of 40000000-2. 0x40000000 becomes a more or
less useless hypervisor-identification signature (useless because you
need to assume that leaves 4000000x, x>2 implement the generic interface
anyway, where x=1,2 are reserved for Xen (=hypervisor-specific) uses).
In other words, what mechanism can a guest use to explicitly identify
the existence of the generic interface? There needs to be a signature
for that somewhere.
J
-
No, really. Xen just _implemented_ the generic interface from the beginning, at least for 0 and 1 (version). The 0x40000002 (hypercall page) looks specific to Xen, but it can be used for KVM as well, thus can be generic (or a hypervisor can tell it's not supported by returning 0 pages for hypercall pages). If Xen implements the new generic feature (defined by 0x40000003, for example), then it returns 40000003 or large So you don't need a signature for that. As I wrote before: 1. detect Xen or KVM <the list> using CPUID.0x40000000 2. Check the version if necessary using CPUID.0x40000001 3. Check the generic features available using CPUID.0x4000000Y, if the max leaf returned >= 0x4000000Y. A guest wants to want to know who the hypervior is for practical purposes (e.g. debuggging) anyway. This is equivalent to what a native OS would do to detect a generic CPU feature. Jun --- Intel Open Source Technology Center -
The only way to have a single interface is if a central authority defines and documents that interface, and all hypervisor implementors agree not to implement extensions. Do you see that happening? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -
It also has the benefit of not requiring an initialization protocol, and I definitely want kvm to be able to emulate the Xen hypercall interface, but there's no need to allow both concurrently. So I'd say use 0x40000000 for detection and the rest cannot clash because detection fails. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -
That's a pain for inline hypercalls tho. I was planning on moving lguest to this model (which is interesting, because AFAICT this insn will cause a #UD or #GP depending on whether VT is supported on this box so I have to look for both). Cheers, Rusty. -
