Hi,
Please find below the proposal for the generic use of cpuid space
allotted for hypervisors. Apart from this cpuid space another thing
worth noting would be that, Intel & AMD reserve the MSRs from 0x40000000
- 0x400000FF for software use. Though the proposal doesn't talk about
MSR's right now, we should be aware of these reservations as we may want
to extend the way we use CPUID to MSR usage as well.
While we are at it, we also think we should form a group which has at
least one person representing each of the hypervisors interested in
generalizing the hypervisor CPUID space for Linux guest OS. This group
will be informed whenever a new CPUID leaf from the generic space is to
be used. This would help avoid any duplicate definitions for a CPUID
semantic by two different hypervisors. I think most of the people are
subscribed to LKML or the virtualization lists and we should use these
lists as a platform to decide on things.
Thanks,
Alok
---
Hypervisor CPUID Interface Proposal
-----------------------------------
Intel & AMD have reserved cpuid levels 0x40000000 - 0x400000FF for
software use. Hypervisors can use these levels to provide an interface
to pass information from the hypervisor to the guest running inside a
virtual machine.
This proposal defines a standard framework for the way in which the
Linux and hypervisor communities incrementally define this CPUID space.
(This proposal may be adopted by other guest OSes. However, that is not
a requirement because a hypervisor can expose a different CPUID
interface depending on the guest OS type that is specified by the VM
configuration.)
Hypervisor Present Bit:
Bit 31 of ECX of CPUID leaf 0x1.
This bit has been reserved by Intel & AMD for use by
hypervisors, and indicates the presence of a hypervisor.
Virtual CPU's (hypervisors) set this bit to 1 and physical CPU's
(all existing and future cpu's) set this bit to zero. This bit
can be probed by the guest ...Excuse me, but that is blatantly idiotic. Expecting the user having to configure a VM to match the target OS is *exactly* as stupid as expecting the user to reconfigure the BIOS. It's totally the wrong thing to do. -hpa --
Hi Peter, Its not a user who has to do anything special here. There are *intelligent* VM developers out there who can export a different CPUid interface depending on the guest OS type. And this is what most of the hypervisors do (not necessarily for CPUID, but for other things right now). --
It doesn't matter, really; it's still the wrong thing to do, for the same reason it's the wrong thing in -- for example -- ACPI, which has similar "cleverness". If we want to have a "Linux standard CPUID interface" suite we should just put them on a different set of numbers and let a hypervisor export all the interfaces. -hpa --
No, that's always a terrible idea. Sure, its necessary to deal with
some backward-compatibility issues, but we should even consider a new
interface which assumes this kind of thing. We want properly enumerable
interfaces.
J
--
The reason we still have to do this is because, Microsoft has already defined a CPUID format which is way different than what you or I are proposing ( with the current case of 256 leafs being available). And I doubt they would change the way they deal with it on their OS. Any proposal that we go with, we will have to export different CPUID interface from the hypervisor for the 2 OS in question. So i think this is something that we anyways will have to do and not worth binging about in the discussion. -- --
No, that's a good hint that what "you and I" are proposing is utterly broken and exactly underscores what I have been stressing about noncompliant hypervisors. All I have seen out of Microsoft only covers CPUID levels 0x40000000 as an vendor identification leaf and 0x40000001 as a "hypervisor identification leaf", but you might have access to other information. This further underscores my belief that using 0x400000xx for anything "standards-based" at all is utterly futile, and that this space should be treated as vendor identification and the rest as vendor-specific. Any hope of creating a standard that's actually usable needs to be outside this space, e.g. in the 0x40SSSSxx space I proposed earlier. -hpa --
T24gMTAvMS8yMDA4IDM6NDY6NDUgUE0sIEguIFBldGVyIEFudmluIHdyb3RlOg0KPiBBbG9rIEth dGFyaWEgd3JvdGU6DQo+ID4gPiBObywgdGhhdCdzIGFsd2F5cyBhIHRlcnJpYmxlIGlkZWEuICBT dXJlLCBpdHMgbmVjZXNzYXJ5IHRvIGRlYWwNCj4gPiA+IHdpdGggc29tZSBiYWNrd2FyZC1jb21w YXRpYmlsaXR5IGlzc3VlcywgYnV0IHdlIHNob3VsZCBldmVuDQo+ID4gPiBjb25zaWRlciBhIG5l dyBpbnRlcmZhY2Ugd2hpY2ggYXNzdW1lcyB0aGlzIGtpbmQgb2YgdGhpbmcuICBXZQ0KPiA+ID4g d2FudCBwcm9wZXJseSBlbnVtZXJhYmxlIGludGVyZmFjZXMuDQo+ID4NCj4gPiBUaGUgcmVhc29u IHdlIHN0aWxsIGhhdmUgdG8gZG8gdGhpcyBpcyBiZWNhdXNlLCBNaWNyb3NvZnQgaGFzDQo+ID4g YWxyZWFkeSBkZWZpbmVkIGEgQ1BVSUQgZm9ybWF0IHdoaWNoIGlzIHdheSBkaWZmZXJlbnQgdGhh biB3aGF0IHlvdQ0KPiA+IG9yIEkgYXJlIHByb3Bvc2luZyAoIHdpdGggdGhlIGN1cnJlbnQgY2Fz ZSBvZiAyNTYgbGVhZnMgYmVpbmcNCj4gPiBhdmFpbGFibGUpLiBBbmQgSSBkb3VidCB0aGV5IHdv dWxkIGNoYW5nZSB0aGUgd2F5IHRoZXkgZGVhbCB3aXRoIGl0IG9uIHRoZWlyIE9TLg0KPiA+IEFu eSBwcm9wb3NhbCB0aGF0IHdlIGdvIHdpdGgsIHdlIHdpbGwgaGF2ZSB0byBleHBvcnQgZGlmZmVy ZW50IENQVUlEDQo+ID4gaW50ZXJmYWNlIGZyb20gdGhlIGh5cGVydmlzb3IgZm9yIHRoZSAyIE9T IGluIHF1ZXN0aW9uLg0KPiA+DQo+ID4gU28gaSB0aGluayB0aGlzIGlzIHNvbWV0aGluZyB0aGF0 IHdlIGFueXdheXMgd2lsbCBoYXZlIHRvIGRvIGFuZCBub3QNCj4gPiB3b3J0aCBiaW5naW5nIGFi b3V0IGluIHRoZSBkaXNjdXNzaW9uLg0KPg0KPiBObywgdGhhdCdzIGEgZ29vZCBoaW50IHRoYXQg d2hhdCAieW91IGFuZCBJIiBhcmUgcHJvcG9zaW5nIGlzIHV0dGVybHkNCj4gYnJva2VuIGFuZCBl eGFjdGx5IHVuZGVyc2NvcmVzIHdoYXQgSSBoYXZlIGJlZW4gc3RyZXNzaW5nIGFib3V0DQo+IG5v bmNvbXBsaWFudCBoeXBlcnZpc29ycy4NCj4NCj4gQWxsIEkgaGF2ZSBzZWVuIG91dCBvZiBNaWNy b3NvZnQgb25seSBjb3ZlcnMgQ1BVSUQgbGV2ZWxzIDB4NDAwMDAwMDANCj4gYXMgYW4gdmVuZG9y IGlkZW50aWZpY2F0aW9uIGxlYWYgYW5kIDB4NDAwMDAwMDEgYXMgYSAiaHlwZXJ2aXNvcg0KPiBp ZGVudGlmaWNhdGlvbiBsZWFmIiwgYnV0IHlvdSBtaWdodCBoYXZlIGFjY2VzcyB0byBvdGhlciBp bmZvcm1hdGlvbi4NCg0KTm8sIGl0IHNheXMgIkxlYWYgMHg0MDAwMDAwMSBhcyBoeXBlcnZpc29y IHZlbmRvci1uZXV0cmFsIGludGVyZmFjZSBpZGVudGlmaWNhdGlvbiwgd2hpY2ggZGV0ZXJtaW5l cyB0aGUgc2VtYW50aWNzIG9mIGxlYXZlcyBmcm9tIDB4NDAwMDAwMDIgdGhyb3VnaCAweDQwMDAw MEZGLiIgVGhlIExlYWYgMHg0MDAwMDAwMCByZXR1cm5zIHZlbmRvciBpZGVudGlmaWVyIHNpZ25h dHVyZSAoaS5lLiBo ...
In other words, 0x40000002+ is vendor-specific space, based on the hypervisor specified in 0x40000001 (in theory); in practice both 0x40000000:0x40000001 since M$ seem to use clever identifiers as What I'm saying is that Microsoft is effectively squatting on the 0x400000xx space with their definition. As written, it's not even clear that it will remain consistent between *their own* hypervisors, even less anyone else's. -hpa --
What it means their hypervisor returns the interface signature (i.e. "Hv#1"), and that defines the interface. If we use "Lv_1", for example, we can define the interface 0x40000002 through 0x400000FF for Linux. Since leaf 0x40000000 and 0x40000001 are separate, we can decouple the hypervisor vender from the interface it supports. This also allows a hypervisor to support multiple interfaces.
And whether a guest wants to use the interface without checking the vender id is a different thing. For Linux, we don't want to hardcode the vender ids in the upstream code at least for such a generic interface.
So I think we need to modify the proposal:
Hypervisor interface identification Leaf:
Leaf 0x40000001.
This leaf returns the interface signature that the hypervisor implements.
# EAX: "Lv_1" (or something)
# EBX, ECX, EDX: Reserved.
Lv_1 interface Leaves:
Leaf range 0x40000002 - 0x4000000FF.
.
Jun Nakajima | Intel Open Source Technology Center
Wrong. This isn't a two-way interface. It's a one-way interface, and it *SHOULD BE*; exposing different information depending on what is running No, it hasn't "clarified my concern" in any way. It's exactly *underscoring* it. In other words, I consider 0x400000xx unusable for anything that is standards-based. The interfaces everyone is currently using aren't designed to export multiple interfaces; they're designed to tell the guest which *one* interface is exported. That is fine, we just need to go elsewhere. -hpa --
What's the significance of supporting multiple interfaces to the same guest simultaneously, i.e. _runtime_? We don't want the guests to run on such a literarily Frankenstein machine. And practically, such testing/debugging would be good only for Halloween :-).
The interface space can be distinct, but the contents are defined and implemented independently, thus you might find overlaps, inconsistency, etc. among the interfaces. And why is runtime "multiple interfaces" required for a standards-based interface?
.
Jun Nakajima | Intel Open Source Technology Center
Yes, and for the reasons outlined in a previous post in this thread, this is an incredibly bad idea. We already hate the guts of the ACPI By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU, since at very least they export Intel-derived and AMD-derived That is the whole point -- without a central coordinating authority, you're going to have to accommodate many definition sources. Otherwise, you're just back to where we started -- each hypervisor exports an interface and that's just that. If there are multiple interface specifications, they should be exported simulateously in non-conflicting numberspaces, and the *GUEST* gets to choose what to believe. We already do this for *all kinds* of information, including CPUID. It's the right thing to do. -hpa --
The big difference here is that you could create a VM at runtime (by combining the existing interfaces) that did not exist before (or was not tested before). For example, a hypervisor could show hyper-v, osx-v (if any), linux-v, etc., and a guest could create a VM with hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such combinations/variations can grow exponentially.
.
Jun Nakajima | Intel Open Source Technology Center
The guest chooses what it wants to use. We already do this: for example, we use CPUID leaf 0x80000006 preferentially to CPUID leaf 2, simply because it is a better interface. And you're absolutely right that the guest may end up picking and choosing different parts of the interfaces. That's how it is supposed to work. -hpa --
No, that would be a horrible, horrible mistake. There's no sane way to
implement that; it would mean that the hypervisor would have to have
some kind of state model that incorporates all the ABIs in a consistent
way. Any guest using multiple ABIs would effectively end up being
dependent on a particular hypervisor via a frankensteinian interface
that no other hypervisor would implement in the same way, even if they
claim to implement the same set of interfaces.
If the hypervisor just needs to deal with one at a time then it can have
relatively simple ABI<->internal state translation.
However, if you have the notion of hypervisor-agnostic or common
interfaces, then you can include those as part of the rest of the ABI
and make it sane (so Xen+common, hyperv+common, etc).
J
--
It depends on what classes of interfaces you're talking about. I think you and Jun have a bit narrow definition of "ABI" in this context. This is functionally equivalent to hardware interfaces (after all, that is what the hypervisor ABI *is* as far as the kernel is concerned) -- noone expects, say, a SATA controller that can run in legacy IDE mode to also take AHCI commands at the same time, but the kernel *does* expect that a chipset which exports LAPIC, HPET, PMTMR and TSC clock sources can use all four at the same time. In the latter case the interfaces are inherently independent and refer to different chunks of hardware which just happen to be related in that they all are related to timing. In the former case, we're dealing with *one* piece of hardware which can operate in one of two modes. For hypervisors, you will end up with cases where you have both types -- for example, KVM will happily use VMware's video interface, but that doesn't mean KVM wants to use VMware's interfaces for storage. This is exactly how it should be: the extent this kind of mix and match that is possible is a matter of the definition of the individual interfaces themselves, not of the overall architecture. -hpa --
Right, that's what I've been suggesting. I think hypervisors should
be able to offer multiple ABIs to guests, but a guest has to commit to
using one exclusively (ie, once they start to use one then the others
turn themselves off, kill the domain, etc).
J
--
Not necessarily, although the example above is extreme. Redundant Not inherently. Of course, there may be interfaces which are interently or by policy mutually exclusive, but a hypervisor should only export the interfaces it wants a guest to be able to use. This is particularly so with CPUID, which is a *data export* interface, it doesn't perform any action. -hpa --
Sure. A common feature across all hypervisor-specific ABIs may get
subsumed into a generic interface which is equivalent to all the
others. That's fine. But nobody should expect to be able to mix
hyperV's lazy tlb interface with KVM's pv mmu updates and expect to get
It should export any interface that it implements fully, but those
interfaces may have contradictory or inconsistent semantics which
Well, sure. There's two distinct issues:
1. Using cpuid to get information about the kernel's environment. If
the environment is sane, then cpuid is a read-only, side-effect
free way of getting information, and any information gathered is
fair game.
2. One of the pieces of information you can get with cpuid is a
discovery of what paravirtual hypercall interfaces the environment
supports, which the guest can compare against its list of
interfaces that it supports. If there's some amount of
intersection, it can decide to use one of those interfaces.
I'm saying that *in general* a guest should expect to be able to use one
and only one of those interfaces. There will be explicitly defined
exceptions to that - such as using generic ABIs in addition to
hypervisor specific ABIs - but a guest can't expect to to be able to mix
and match.
A tricky issue with selecting an ABI is if two hypervisors end up using
exactly the same mechanism for implementing hypercalls (or whatever), so
that there needs to be some explicit way for the guest to nominate which
interface its actually using...
J
--
If you can only expose one interface, you need to have the user choose. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
I also observe that your proposal provides no mean of positive identification, i.e. that a hypervisor actually conforms to your proposal. -hpa --
No, we're not getting anywhere. This is an outright broken idea. The
space is too small to be able to chop up in this way, and the number of
vendors too large to be able to do it without having a central oversight.
The only way this can work is by having explicit positive identification
of each group of leaves with a signature. If there's a recognizable
signature, then you can inspect the rest of the group; if not, then you
can't. That way, you can avoid any leaf usage which doesn't conform to
this model, and you can also simultaneously support multiple hypervisor
ABIs. It also accommodates existing hypervisor use of this leaf space,
even if they currently use a fixed location within it.
A concrete counter-proposal:
The space 0x40000000-0x400000ff is reserved for hypervisor usage.
This region is divided into 16 16-leaf blocks. Each block has the
structure:
0x400000x0:
eax: max used leaf within the leaf block (max 0x400000xf)
e[bcd]x: leaf block signature. This may be a hypervisor-specific
signature, or a generic signature, depending on the contents of the block
A guest may search for any supported Hypervisor ABIs by inspecting each
leaf at 0x400000x0 for a known signature, and then may choose its mode
of operation accordingly. It must ignore any unknown signatures, and
not touch any of the leaves within an unknown leaf block.
Hypervisor vendors who want to add a hypervisor-specific leaf block must
choose a signature which is recognizably related to their or their
hypervisor's name.
Signatures starting with "Generic" are reserved for generic leaf blocks.
A guest may scan leaf blocks to enumerate what hypervisor ABIs/hypercall
interfaces are available to it. It may mix and match any information
from leaves it understands. However, once it starts using a specific
hypervisor ABI by making hypercalls or doing other operations with
side-effects, it must commit to using that ABI exclusively (a specific
hypervisor ABI may include ...I suspect we can get a larger number space if we ask Intel & AMD. In fact, I think we should request that the entire 0x40xxxxxx numberspace is assigned to virtualization *anyway*. -hpa --
Yes, that would be good. In that case I'd revise my proposal to back
each leaf block 256 leaves instead of 16. But it still needs to be a
proper enumeration with signatures, rather than assigning fixed points
in that space to specific interfaces.
J
--
With a sufficiently large block, we could use fixed points, e.g. by having each vendor create interfaces in the 0x40SSSSXX range, where SSSS is the PCI ID they use for PCI devices. Note that I said "create interfaces". It's important that all about this is who specified the interface -- for "what hypervisor is this" just use 0x40000000 and disambiguate based on that. -hpa --
Sure, you could do that, but you'd still want to have a signature in
0x40SSSS00 to positively identify the chunk. And what if you wanted
"What hypervisor is this?" isn't a very interesting question; if you're
even asking it then it suggests that something has gone wrong. Its much
more useful to ask "what interfaces does this hypervisor support?", and
enumerating a smallish range of well-known leaves looking for signatures
is the simplest way to do that. (We could use signatures derived from
the PCI vendor IDs which would help with managing that namespace.)
J
--
What you'd want, at least, is a standard CPUID identification and range leaf at the top. 256 leaves is a *lot*, though; I'm not saying one couldn't run out, but it'd be hard. Keep in mind that for large objects there are "counting" CPUID levels, as much as I personally dislike them, and one could easily argue that if you're doing something that would require anywhere near 256 leaves you probably are storing bulk data that belongs elsewhere. Of course, if we had some kind of central authority assigning 8-bit IDs that would be even better, especially since there are tools in the field which already scan on 64K boundaries. I don't know, though, how likely I agree completely, of course (except that "what hypervisor is this" still has limited usage, especially when it comes to dealing with bug workarounds. Similar to the way we use CPU vendor IDs and stepping numbers for physical CPUs.) -hpa --
I'm assuming that the likelihood of getting all possible vendors -
current and future - to agree to a scheme like this is pretty small. We
need to come up with something that will work well when there are
I guess. Its certainly useful to be able to identify the hypervisor for
bug reporting and just general status information. But making
functional changes on that basis should be a last resort.
J
--
It's essentially already happening. Everyone wants to be a better hyperv than hyperv ;-) --
I see following issues with this proposal, 1. Kernel complexity : Just thinking about the complexity that this will put in the kernel to handle these multiple ABI signatures and scanning all of these leaf block's is difficult to digest. 2. Divergence in the interface provided by the hypervisors : The reason we brought up a flat hierarchy is because we think we should be moving towards a approach where the guest code doesn't diverge too much when running under different hypervisors. That is the guest essentially does the same thing if its running on say Xen or VMware. This design IMO, will take us a step backward to what we already have seen with para virt ops. Each hypervisor (mostly) defines its own cpuid block, the guest correspondingly needs to have code to handle each of these cpuid blocks, with these blocks will mostly being exclusive. 3. Is their a need to do all this over engineering : Aren't we over engineering a simple interface over here. The point is, there are right now 256 cpuid leafs do we realistically think we are ever going to exhaust all these leafs. We are really surprised to know that people may think this space is small enough. It would be interesting to know what all use you might want to put cpuid for. Thanks, --
What's wrong with what we have in paravirt_ops? Just agreeing on CPUID doesn't help very much. You still need a mechanism for doing hypercalls to implement anything meaningful. We aren't going to agree on a hypercall mechanism. KVM uses direct hypercall instructions, Xen uses a hypercall page, VMware uses VMI, Hyper-V uses MSR writes. We all have already defined the hypercall namespace in a certain way. We've already gone down the road of trying to make standard paravirtual interfaces (via virtio). No one was sufficiently interested in collaborating. I don't see why other paravirtualizations are going to be much different. Regards, Anthony Liguori --
The point is to be able to support those interfaces. Presently a Linux guest will test and find out which HV it's running on, and adapt. Another guest will fail to enlighten itself, and perf will suffer...yadda, yadda. thanks, -chris --
Agreeing on CPUID does not get us close at all to having shared interfaces for paravirtualization. As I said in another note, there are more fundamental things that we differ on (like hypercall mechanism) that's going to make that challenging. We already are sharing code, when appropriate (see the Xen/KVM PV clock interface). Regards, --
Your explanation below answers the question you raised, the problem being we need to have support for each of these different hypercall mechanisms in the kernel. I understand that this was the correct thing to do at that moment. But do we want to go the same way again for CPUID when we can make it generic (flat enough) for anybody to use it in the same manner and Thanks, Alok --
But what sort of information can be stored in cpuid that's actually useful? Right now we just it in KVM for feature bits. Most of the stuff that's interesting is stored in shared memory because a guest can read that without taking a vmexit or via a hypercall. We can all agree upon a common mechanism for doing something but if no one is using that mechanism to do anything significant, what purpose does it serve? Regards, Anthony Liguori --
The scanning for the signatures is trivial; it's not a significant
amount of code. Actually implementing them is a different matter, but
that's the same regardless of where they are placed or how they're
discovered. After discovery its the same either way: there's a leaf
I guess, but the bulk of the uses of this stuff are going to be
hypervisor-specific. You're hard-pressed to come up with any other
generic uses beyond tsc. In general, if a hypervisor is going to put
something in a special cpuid leaf, its because there's no other good way
to represent it. Generic things are generally going to appear as an
emulated piece of the virtualized platform, in ACPI, DMI, a
Look, if you want to propose a way to use that cpuid space in a
reasonably flexible way that allows it to be used as the need arises,
then we can talk about it. But I think your proposal is a poor way to
achieve those ends
If you want blessing for something that you've already implemented and
shipped, well, you don't need anyone's blessing for that.
J
--
And arguably, storing TSC frequency in CPUID is a terrible interface because the TSC frequency can change any time a guest is entered. It really should be a shared memory area so that a guest doesn't have to vmexit to read it (like it is with the Xen/KVM paravirt clock). Regards, --
True for older hardware, newer hardware should fix this. I guess the point is, the are numbers that are easy to measure incorrectly in guest. Doesn't justify the whole thing.. --
It's not fixed for newer hardware. Larger systems still have multiple tsc frequencies. -- error compiling committee.c: too many arguments to function --
It's not terrible, it's actually brilliant. TSC is part of the processor architecture, the processor should a way to tell us what speed it is. Having a TSC with no interface to determine the frequency is a terrible design flaw. This is what caused the problem in the first place. And now we're trying to fiddle around with software wizardry what should be done in hardware in the first place. Once again, para-virtualization is basically useless. We can't agree on a solution without over-designing some complex system with interface signatures and multi-vendor cooperation and nonsense. Solve the non-virtualized problem and the virtualized problem goes away. Jun, you work at Intel. Can you ask for a new architecturally defined MSR that returns the TSC frequency? Not a virtualization specific MSR. A real MSR that would exist on physical processors. The TSC started as an MSR anyway. There should be another MSR that tells the frequency. If it's hard to do in hardware, it can be a write-once MSR that gets initialized by the BIOS. It's really a very simple solution to a very common problem. Other MSRs are dedicated to bus speed and so on, this seems remarkably similar. Once the physical problem is solved, the virtualized problem doesn't even exist. We simply add support for the newly defined MSR and voilla. Other chipmakers probably agree it's a good idea and go along with it too, and in the meantime, reading a non-existent MSR is a fairly harmlessly handled #GP. I realize it's the wrong thing for us now, but long term, it's the only architecturally 'correct' approach. You can even extend it to have visible TSC frequency changes clocked via performance counter events (and then get interrupts on those events if you so wish), solving the dynamic problem too. Paravirtualization is a symptom of an architectural problem. We should always be trying to fix the architecture first. Zach --
It does. 1 tick == 1 tick. The processor doesn't have a concept of wall clock time so wall clock units don't make much sense. If it did, I'd say, screw the TSC, just give me a ns granular time stamp and let's rdtscp sort of gives you this. But still, just give me my rdnsc and So a solution is needed that works for now. Anything that requires a vmexit is bad because the TSC frequency can change quite often. Even if you ignore the troubles with frequency scaling on older processors and VCPU migration across NUMA nodes, there will be a very visible change in TSC frequency after a live migration. So there are two possible solutions. Have a shared memory area that the guest can consult that has the latest TSC frequency (this is what KVM and Xen do) or have some sort of interrupt mechanism that notifies the guest when the TSC frequency changes after which, software can do something that vmexits to get the TSC frequency. The proposed solution doesn't include a TSC frequency change notification mechanism. This is part of the problem with this sort of approach to standardization. It's hard to come up with the best interface at first. You have to try a couple ways, and then everyone can eventually standardize on the best one if one ever emerges. Regards, --
Ah, if it was only that simple. Transmeta actually did this, but it's not as useful as you think. There are at least three crystals in modern PCs: one at 32.768 kHz (for the RTC), one at 14.31818 MHz (PIT, PMTMR and HPET), and one at a higher frequency (often 200 MHz.) All the main data distribution clocks in the system are derived from the third, which is subject to spread-spectrum modulation due to RFI concerns. Therefore, relying on the *nominal* frequency of this clock is vastly incorrect; often by as much as 2%. Spread-spectrum modulation is supposed to vary around zero enough that the spreading averages out, but the only way to know what the center frequency actually is is to average. Furthermore, this high-frequency clock is generally not calibrated anywhere near as well as the 14 MHz clock; in good designs the 14 MHz is actually a TCXO (temperature compensated crystal oscillator), which is accurate to something like ±2 ppm. -hpa --
For what it's worth, Transmeta's implementation used CPUID leaf 0x80860001.ECX to give the TSC frequency rounded to the nearest MHz. The caveat of spread-spectrum modulation applies. -hpa --
I'm not suggesting using the nominal value. I'm suggesting the measurement be done in the one and only place where there is perfect control of the system, the processor boot-strapping in the BIOS. Only the platform designers themselves know the speed of the oscillator which is modulating the clock and so only they should be calibrating the speed of the TSC. If this modulation really does alter the frequency by +/- 2% (seems high to me, but hey, I don't design motherboards), using an LFO, then basically all the calibration done in Linux is broken and has been for some time. You can't calibrate only once, or risk being off by 2%, you can't calibrate repeatedly and take the fastest estimate, or you are off by 2%, and you can't calibrate repeatedly and take the average without risking SMI noise affecting the lowest clock speed measurement, contributing unknown error. Hmm. Re-reading your e-mail, I see you are saying the nominal frequency may be off by 2% (and I easily believe that), not necessarily that the frequency modulation may be 2% (which I still think is high). Does anyone know what the actual bounds on spread spectrum modulation are or how fast the clock is modulated? Zach --
No. *Noone*, including the manufacturers, know the speed of the oscillator which is modulating the clock. What you have to do is average over a timespan which is long enough that the SSM averages out (a relatively small fraction of a second.) As for trusting the BIOS on this, that's a total joke. Firmware vendors You have to calibrate over a sample interval long enough that the SSM No, I'm saying the frequency modulation may be up to 2%. Typically it is something like [-2%,+0%]. -hpa --
