In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. There are many approaches to achieve this but the most direct is to simply avoid trapping the HLT instruction which lets the guest directly execute the instruction putting the processor to sleep. Introduce this as a module-level option for kvm-vmx.ko since if you do this for one guest, you probably want to do it for all. A similar option is possible for AMD but I don't have easy access to AMD test hardware. Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> --- v1 -> v2 - Rename parameter to yield_on_hlt - Remove __read_mostly diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index caa967e..d8310e4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO); static int __read_mostly vmm_exclusive = 1; module_param(vmm_exclusive, bool, S_IRUGO); +static int yield_on_hlt = 1; +module_param(yield_on_hlt, bool, S_IRUGO); + #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \ (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) #define KVM_GUEST_CR0_MASK \ @@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) &_pin_based_exec_control) < 0) return -EIO; - min = CPU_BASED_HLT_EXITING | + min = #ifdef CONFIG_X86_64 CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING | @@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING | CPU_BASED_INVLPG_EXITING; + + if (yield_on_hlt) + min |= CPU_BASED_HLT_EXITING; + opt = CPU_BASED_TPR_SHADOW | CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; -- 1.7.0.4 --
In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. i could not understand why need this? can you tell more detailedly? thanks. --
If you run 4 guests on a CPU, and they're all trying to consume 100% CPU, all things being equal, you'll get ~25% CPU for each guest. However, if one guest is idle, you'll get something like 1% 32% 33% 32%. This characteristic is usually desirable because it increase aggregate throughput but in some circumstances, determinism is more desirable than aggregate throughput. This patch essentially makes guest execution non-work conserving by making it appear to the scheduler that each guest wants 100% CPU even though they may be idling. That means that regardless of what each guest is doing, if you have four guests on one CPU, each will get ~25% CPU[1]. [1] there are corner cases around things like forced sleep due to PFs and the like. The goal is not for 100% determinism but more to at least obtain more significantly more determinism than we have now. Regards, --
If you run 4 guests on a CPU, and they're all trying to consume 100% CPU, all things being equal, you'll get ~25% CPU for each guest. However, if one guest is idle, you'll get something like 1% 32% 33% 32%. This characteristic is usually desirable because it increase aggregate throughput but in some circumstances, determinism is more desirable than aggregate throughput. This patch essentially makes guest execution non-work conserving by making it appear to the scheduler that each guest wants 100% CPU even though they may be idling. That means that regardless of what each guest is doing, if you have four guests on one CPU, each will get ~25% CPU[1]. [1] there are corner cases around things like forced sleep due to PFs and the like. The goal is not for 100% determinism but more to at least obtain more significantly more determinism than we have now. Regards, --
What if one of the guest crashes qemu or invokes a powerdown? Suddenly the others get 33% each (with 1% going to my secret round-up account). Doesn't seem like a reliable way to limit cpu. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Some monitoring tool will need to catch that event and spawn a "dummy" VM to consume 25% cpu, bringing back everyone's use to 25% as before. That's admittedly not neat, but that's what we are thinking of atm in absence of a better solution to the problem (ex: kernel scheduler supporting hard-limits). - vatsa --
Breaks async PF (see "checks on guest state"), timer reinjection probably. It should be possible to achieve determinism with a scheduler policy? --
Timer reinjection will continue to work as expected. If a guest is halting an external interrupt is delivered (by a timer), the guest will still exit as expected. I can think of anything that would be functionally correct and still depend on getting hlt exits because ultimately, a guest never actually If the desire is the ultimate desire is to have the guests be scheduled in a non-work conserving fashion, I can't see a more direct approach that to simply not have the guests yield (which is ultimately what hlt trapping does). Anything the scheduler would do is after the fact and probably based on inference about why the yield. Regards, --
VCPU in HLT state only allows injection of certain events that would be delivered on HLT. #PF is not one of them. You'd have to handle this situation on event injection, vmentry fails LAPIC pending timer events will be reinjected on entry path, if accumulated. So they depend on any exit. If you disable HLT-exiting, Another issue is you ignore the hosts idea of the best way to sleep (ACPI, or whatever). And handling inactive HLT state (which was never enabled) can be painful. --
But you can't inject an exception into a guest while the VMCS is active, can you? So the guest takes an exit while in the hlt instruction but that's no different than if the guest has been interrupted because of So this works today because on a hlt exit, emulate_halt() will clear the the HLT state which then puts the the vcpu into a state where it can receive an exception injection? Regards, --
hlt exiting doesn't leave vcpu in the halted state (since hlt has not The halt state is never entered. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Async PF completion do not kick vcpu out of a guest mode. It wakes vcpu only if it is waiting on waitqueue. It was done to not generate -- Gleb. --
Perhaps it should be a VM level option. And then invert the notion. Create one idle domain w/out hlt trap. Give that VM a vcpu per pcpu (pin in place probably). And have that VM do nothing other than hlt. Then it's always runnable according to scheduler, and can "consume" the extra work that CFS wants to give away. What do you think? thanks, -chris --
That's an interesting idea. I think Vatsa had some ideas about how to do this with existing mechanisms. I'm interesting in comparing behavior with fixed allocation because one thing the above relies upon is that the filler VM loses it's time when one of the non-filler VCPU needs to run. This may all work correctly but I think it's easier to rationalize about having each non-filler VCPU have a fixed (long) time slice. If a VCPU needs to wake up to become non-idle, it can do so immediately because it already has the PCPU. Regards, --
The flipside...dont' have to worry about the issues that Marcelo brought up. Should be pretty easy to compare though. thanks, -chris --
Consuming the timeslice outside guest mode is less intrusive and easier
to replace. Something like this should work?
if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
while (!need_resched())
default_idle();
}
But you agree this is no KVM business.
--
Like non-trapping hlt, that too will guarantee that the guest is preempted by timeslice exhaustion (and is simpler than non-trapping hlt). So it may well be the simplest for the case where we are perfectly committed (i.e. the vcpu fractional core count totals the pcpu count). But once we are undercommitted we still need some extra logic to handle the hard cap and something to kick the running guest off the cpu and suck up the extra cycles in a power conserving way. thanks, -chris --
I'm not entirely sure TBH. If you think of a cloud's per-VCPU capacity in terms of Compute Units, having a model where a VCPU maps to 1-3 units depending on total load is potentially interesting particularly if the VCPU's capacity only changes in discrete amounts, that the expected capacity is communicated to the guest, and that the capacity only changes periodically. Regards, --
OK, let's say a single PCPU == 12 Compute Units. If the guest is the first to migrate to a newly added unused host, and we are using either non-trapping hlt or Marcelo's non-yielding trapping hlt, then that guest is going to get more CPU than it expected unless there is some throttling mechanism. Specifically, it will get 12CU instead of 1-3CU. Do you agree with that? thanks, -chris --
Yes. There's definitely a use-case to have a hard cap. But I think another common use-case is really just performance isolation. If over the course of a day, you go from 12CU, to 6CU, to 4CU, that might not be that bad of a thing. If the environment is designed correctly, of N nodes, N-1 will always be at capacity so it's really just a single node hat is under utilized. Regards, --
OK, good, just wanted to be clear. Because this started as a discussion of hard caps, and it began to sound as if you were no longer advocating I guess it depends on your SLA. We don't have to do anything to give varying CU based on host load. That's the one thing CFS will do for Many clouds do a variation on Small, Medium, Large sizing. So depending on the scheduler (best fit, rr...) even the notion of at capacity may change from node to node and during the time of day. thanks, -chris --
I'm really anticipating things like the EC2 micro instance where the CPU allotment is variable. Variable allotments are interesting from a density perspective but having interdependent performance is definitely a problem. Another way to think about it: a customer reports a performance problem at 1PM. With non-yielding guests, you can look at logs and see that the expected capacity was 2CU (it may have changed to 4CU at 3PM). However, without something like non-yielding guests, the performance is almost entirely unpredictable and unless you have an exact timestamp from the customer along with a fine granularity performance log, there's no way An ideal cloud will make sure that something like 4 Small == 2 Medium == 1 Large instance and that the machine capacity is always a multiple of Large instance size. With a division like this, you can always achieve maximum density provided that you can support live migration. Regards, --
My initial inclination is that this would be inappropriate for KVM but I think I'm slowly convincing myself otherwise. Ultimately, hard limits and deterministic scheduling are related goals but not quite the same. A hard limit is very specific, you want to receive no more than an exit amount of CPU time per VCPU With deterministic scheduling, you want to make sure that a set of VMs are not influenced by each other's behavior. You want hard limits when you want to hide the density/capacity of a node from the end customer. You want determinism when you simply want to isolate the performance of each customer from the other customers. That is, the only thing that should affect the performance graph of a VM is how many neighbors it has (which is controlled by management software) rather than what its neighbors are doing. If you have hard limits, you can approximate deterministic scheduling but it's complex in the face of changing numbers of guests. Additionally, hard limits present issues with directed yield that don't exist with a deterministic scheduling approach. You can still donate your time slice to another VCPU because the VCPUs are not actually capped. That may mean that an individual VCPU gets more PCPU time than an exact division but for the VM overall, it won't get more than it's total share. So the principle of performance isolation for the guest isn't impacted. Regards, Anthony Liguori --
What's the difference between this and the Linux idle threads? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
If we have 3 VMs and want to give them 25% each of a CPU, then having just idle thread would end up giving them 33%. One way of achieving 25% rate limit is to create a "dummy" or "filler" VM, and let it compete for resource, thus rate-limiting everyone to 25% in this case. Essentially we are tackling rate-limit problem by creating additional "filler" VMs/threads that will compete for resource, thus keeping in check how much cpu resource is consumed by "real" VMs. Admittedly not as neat as having a in-kernel support for rate-limit. - vatsa --
That's not sufficient. Lets we have 3 guests A, B, C that need to be rate limited to 25% on a single cpu system. We create this idle guest D that is 100% cpu hog as per above definition. Now when one of the guest is idle, what ensures that the idle cycles of A is given only to D and not partly to B/C? - vatsa --
To tackle this problem, I was thinking of having a fill-thread associated with each vcpu (i.e both belong to same cgroup). Fill-thread consumes idle cycles left by vcpu, but otherwise doesn't compete with it for cycles. - vatsa --
That's what Marcelo's suggestion does w/out a fill thread. --
Are we willing to add that to KVM sources? I was working under the constraints of not modifying the kernel (especially avoid adding short term hacks that become unnecessary in longer run, in this case when kernel-based hard limits goes in). - vatsa --
There's one complication though even with that. How do we compute the real utilization of VM (given that it will appear to be burning 100% cycles)? We need to have scheduler discount the cycles burnt post halt-exit, so more stuff is needed than those simple 3-4 lines! - vatsa --
Heh, was just about to say the same thing ;) --
My first reaction is that it's not terribly important to account the non-idle time in the guest because of the use-case for this model. Eventually, it might be nice to have idle time accounting but I don't see it as a critical feature here. Non-idle time simply isn't as meaningful here as it normally would be. If you have 10 VMs in a normal environment and saw that you had only 50% CPU utilization, you might be inclined to add more VMs. But if you're offering deterministic execution, it doesn't matter if you only have "50%" utilization. If you add another VM, the guests will get exactly the same impact as if they were using 100% utilization. Regards, --
Agreed ...but I was considering the larger user-base who may be surprised to see their VMs being reported as 100% hogs when they had left it idle. - vatsa --
Depends on the chargeback model. This would put guest vcpu runtime vs
host running guest vcpu time really out of skew. ('course w/out steal
and that time it's already out of skew). But I think most models are
Who is "you"? cloud user, or cloud service provider's scheduler?
On the user side, 50% cpu utilization wouldn't trigger me to add new
VMs. On the host side, 50% cpu utilization would have to be measure
Sorry, didn't follow here?
thanks,
-chris
--
Right. I'm not familiar with any models that are actually based on CPU-consumption based accounting. In general, the feedback I've received is that predictable accounting is pretty critical so I don't anticipate something as volatile as CPU-consumption ever being something The question is, why would something care about host CPU utilization? The answer I can think of is, something wants to measure host CPU utilization to identify an underutilized node. One the underutilized node is identified, more work can be given to it. Adding more work to an underutilized node doesn't change the amount of work that can be done. More concretely, one PCPU, four independent VCPUs. They are consuming, 25%, 25%, 25%, 12% respectively. My management software says, ah hah, I can stick a fifth VCPU on this box that's only using 5%. The other VCPUs are unaffected. However, in a no-yield-on-hlt model, if I have four VCPUs, they each get 25%, 25%, 25%, 25% on the host. Three of the VCPUs are running 100% in the guest and one is running 50%. If I add a fifth VCPU, even if it's only using 5%, each VCPU drops to 20%. That means the three VCPUS that are consuming 100% now see a 25% drop in their performance even though you've added an idle guest. Basically, the traditional view of density simply doesn't apply in this model. Regards, --
Probably yes. The point is, you get the same effect as with the non-trapping hlt but without the complications on low-level VMX/SVM code. Even better if you can do it with fill thread idea. --
Well, no. Better to consume hlt time but yield if need_resched or in case of any event which breaks out of kvm_vcpu_block. --
Yeah, I pictured priorties handling this. --
All guest are of equal priorty in this case (that's how we are able to divide time into 25% chunks), so unless we dynamically boost D's priority based on how idle other VMs are, its not going to be easy! - vatsa --
Right, I think there has to be an external mgmt entity. Because num vcpus is not static. So priorities have to be rebalanaced at vcpu create/destroy time. thanks, -chris --
and at idle/non-idle time as well, which makes the mgmt entity's job rather harder? Anyway, if we are willing to take a patch to burn cycles upon halt (as per Marcello's patch), that's be the best (short-term) solution ..otherwise, something like a filler-thread per-vcpu is more easier than dynamic change of priorities .. - vatsa --
We've actually done a fair amount of testing with using priorities like this. The granularity is extremely poor because priorities don't map linearly to cpu time allotment. The interaction with background tasks also gets extremely complicated. Regards, --
