Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

Previous thread: Possiblle misleading information of motherboards supporting VT-d by Prasad Joshi on Thursday, December 2, 2010 - 5:10 am. (1 message)

Next thread: [PATCH 0/2] Fix reboot on non-preemptible kernels by Avi Kivity on Thursday, December 2, 2010 - 9:00 am. (3 messages)
From: Anthony Liguori
Date: Thursday, December 2, 2010 - 6:59 am

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
v1 -> v2
 - Rename parameter to yield_on_hlt
 - Remove __read_mostly

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index caa967e..d8310e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
 static int __read_mostly vmm_exclusive = 1;
 module_param(vmm_exclusive, bool, S_IRUGO);
 
+static int yield_on_hlt = 1;
+module_param(yield_on_hlt, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST				\
 	(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK						\
@@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 				&_pin_based_exec_control) < 0)
 		return -EIO;
 
-	min = CPU_BASED_HLT_EXITING |
+	min =
 #ifdef CONFIG_X86_64
 	      CPU_BASED_CR8_LOAD_EXITING |
 	      CPU_BASED_CR8_STORE_EXITING |
@@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 	      CPU_BASED_MWAIT_EXITING |
 	      CPU_BASED_MONITOR_EXITING |
 	      CPU_BASED_INVLPG_EXITING;
+
+	if (yield_on_hlt)
+		min |= CPU_BASED_HLT_EXITING;
+
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |
 	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
-- 
1.7.0.4

--

From: lidong chen
Date: Thursday, December 2, 2010 - 7:39 am

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.

i could not understand why need this? can you tell more detailedly?
thanks.


--

From: Anthony Liguori
Date: Thursday, December 2, 2010 - 8:23 am

If you run 4 guests on a CPU, and they're all trying to consume 100% 
CPU, all things being equal, you'll get ~25% CPU for each guest.

However, if one guest is idle, you'll get something like 1% 32% 33% 
32%.  This characteristic is usually desirable because it increase 
aggregate throughput but in some circumstances, determinism is more 
desirable than aggregate throughput.

This patch essentially makes guest execution non-work conserving by 
making it appear to the scheduler that each guest wants 100% CPU even 
though they may be idling.

That means that regardless of what each guest is doing, if you have four 
guests on one CPU, each will get ~25% CPU[1].

[1] there are corner cases around things like forced sleep due to PFs 
and the like.  The goal is not for 100% determinism but more to at least 
obtain more significantly more determinism than we have now.

Regards,


--

From: Anthony Liguori
Date: Thursday, December 2, 2010 - 8:23 am

If you run 4 guests on a CPU, and they're all trying to consume 100% 
CPU, all things being equal, you'll get ~25% CPU for each guest.

However, if one guest is idle, you'll get something like 1% 32% 33% 
32%.  This characteristic is usually desirable because it increase 
aggregate throughput but in some circumstances, determinism is more 
desirable than aggregate throughput.

This patch essentially makes guest execution non-work conserving by 
making it appear to the scheduler that each guest wants 100% CPU even 
though they may be idling.

That means that regardless of what each guest is doing, if you have four 
guests on one CPU, each will get ~25% CPU[1].

[1] there are corner cases around things like forced sleep due to PFs 
and the like.  The goal is not for 100% determinism but more to at least 
obtain more significantly more determinism than we have now.

Regards,


--

From: Avi Kivity
Date: Friday, December 3, 2010 - 2:38 am

What if one of the guest crashes qemu or invokes a powerdown?  Suddenly 
the others get 33% each (with 1% going to my secret round-up account).  
Doesn't seem like a reliable way to limit cpu.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Srivatsa Vaddagiri
Date: Friday, December 3, 2010 - 4:12 am

Some monitoring tool will need to catch that event and spawn a
"dummy" VM to consume 25% cpu, bringing back everyone's use to 25% as before.

That's admittedly not neat, but that's what we are thinking of atm in absence of
a better solution to the problem (ex: kernel scheduler supporting hard-limits).

- vatsa

--

From: Marcelo Tosatti
Date: Thursday, December 2, 2010 - 10:37 am

Breaks async PF (see "checks on guest state"), timer reinjection
probably. It should be possible to achieve determinism with 
a scheduler policy?



--

From: Anthony Liguori
Date: Thursday, December 2, 2010 - 12:07 pm

Timer reinjection will continue to work as expected.  If a guest is 
halting an external interrupt is delivered (by a timer), the guest will 
still exit as expected.

I can think of anything that would be functionally correct and still 
depend on getting hlt exits because ultimately, a guest never actually 

If the desire is the ultimate desire is to have the guests be scheduled 
in a non-work conserving fashion, I can't see a more direct approach 
that to simply not have the guests yield (which is ultimately what hlt 
trapping does).

Anything the scheduler would do is after the fact and probably based on 
inference about why the yield.

Regards,


--

From: Marcelo Tosatti
Date: Thursday, December 2, 2010 - 1:12 pm

VCPU in HLT state only allows injection of certain events that
would be delivered on HLT. #PF is not one of them.

You'd have to handle this situation on event injection, vmentry fails

LAPIC pending timer events will be reinjected on entry path, if
accumulated. So they depend on any exit. If you disable HLT-exiting,

Another issue is you ignore the hosts idea of the best way to sleep
(ACPI, or whatever).

And handling inactive HLT state (which was never enabled) can be painful.

--

From: Anthony Liguori
Date: Thursday, December 2, 2010 - 1:51 pm

But you can't inject an exception into a guest while the VMCS is active, 
can you?  So the guest takes an exit while in the hlt instruction but 
that's no different than if the guest has been interrupted because of 

So this works today because on a hlt exit, emulate_halt() will clear the 
the HLT state which then puts the the vcpu into a state where it can 
receive an exception injection?

Regards,


--

From: Avi Kivity
Date: Friday, December 3, 2010 - 2:36 am

hlt exiting doesn't leave vcpu in the halted state (since hlt has not 

The halt state is never entered.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Gleb Natapov
Date: Friday, December 3, 2010 - 5:40 am

Async PF completion do not kick vcpu out of a guest mode. It wakes vcpu
only if it is waiting on waitqueue. It was done to not generate

--
			Gleb.
--

From: Chris Wright
Date: Thursday, December 2, 2010 - 12:14 pm

Perhaps it should be a VM level option.  And then invert the notion.
Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
(pin in place probably).  And have that VM do nothing other than hlt.
Then it's always runnable according to scheduler, and can "consume" the
extra work that CFS wants to give away.

What do you think?

thanks,
-chris
--

From: Anthony Liguori
Date: Thursday, December 2, 2010 - 1:25 pm

That's an interesting idea.  I think Vatsa had some ideas about how to 
do this with existing mechanisms.

I'm interesting in comparing behavior with fixed allocation because one 
thing the above relies upon is that the filler VM loses it's time when 
one of the non-filler VCPU needs to run.  This may all work correctly 
but I think it's easier to rationalize about having each non-filler VCPU 
have a fixed (long) time slice.  If a VCPU needs to wake up to become 
non-idle, it can do so immediately because it already has the PCPU.

Regards,


--

From: Chris Wright
Date: Thursday, December 2, 2010 - 1:40 pm

The flipside...dont' have to worry about the issues that Marcelo brought
up.

Should be pretty easy to compare though.

thanks,
-chris
--

From: Marcelo Tosatti
Date: Thursday, December 2, 2010 - 1:40 pm

Consuming the timeslice outside guest mode is less intrusive and easier
to replace. Something like this should work?

if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
    while (!need_resched())
        default_idle();
} 

But you agree this is no KVM business.

--

From: Chris Wright
Date: Thursday, December 2, 2010 - 2:07 pm

Like non-trapping hlt, that too will guarantee that the guest is preempted
by timeslice exhaustion (and is simpler than non-trapping hlt).  So it
may well be the simplest for the case where we are perfectly committed
(i.e. the vcpu fractional core count totals the pcpu count).  But once
we are undercommitted we still need some extra logic to handle the hard
cap and something to kick the running guest off the cpu and suck up the
extra cycles in a power conserving way.

thanks,
-chris
--

From: Anthony Liguori
Date: Thursday, December 2, 2010 - 3:37 pm

I'm not entirely sure TBH.

If you think of a cloud's per-VCPU capacity in terms of Compute Units, 
having a model where a VCPU maps to 1-3 units depending on total load is 
potentially interesting particularly if the VCPU's capacity only changes 
in discrete amounts,  that the expected capacity is communicated to the 
guest, and that the capacity only changes periodically.

Regards,


--

From: Chris Wright
Date: Thursday, December 2, 2010 - 7:42 pm

OK, let's say a single PCPU == 12 Compute Units.

If the guest is the first to migrate to a newly added unused host, and
we are using either non-trapping hlt or Marcelo's non-yielding trapping
hlt, then that guest is going to get more CPU than it expected unless
there is some throttling mechanism.  Specifically, it will get 12CU
instead of 1-3CU.

Do you agree with that?

thanks,
-chris
--

From: Anthony Liguori
Date: Thursday, December 2, 2010 - 8:21 pm

Yes.

There's definitely a use-case to have a hard cap.

But I think another common use-case is really just performance 
isolation.  If over the course of a day, you go from 12CU, to 6CU, to 
4CU, that might not be that bad of a thing.

If the environment is designed correctly, of N nodes, N-1 will always be 
at capacity so it's really just a single node hat is under utilized.

Regards,


--

From: Chris Wright
Date: Thursday, December 2, 2010 - 8:44 pm

OK, good, just wanted to be clear.  Because this started as a discussion
of hard caps, and it began to sound as if you were no longer advocating

I guess it depends on your SLA.  We don't have to do anything to give
varying CU based on host load.  That's the one thing CFS will do for

Many clouds do a variation on Small, Medium, Large sizing.  So depending
on the scheduler (best fit, rr...) even the notion of at capacity may
change from node to node and during the time of day.

thanks,
-chris
--

From: Anthony Liguori
Date: Friday, December 3, 2010 - 7:25 am

I'm really anticipating things like the EC2 micro instance where the CPU 
allotment is variable.  Variable allotments are interesting from a 
density perspective but having interdependent performance is definitely 
a problem.

Another way to think about it: a customer reports a performance problem 
at 1PM.  With non-yielding guests, you can look at logs and see that the 
expected capacity was 2CU (it may have changed to 4CU at 3PM).  However, 
without something like non-yielding guests, the performance is almost 
entirely unpredictable and unless you have an exact timestamp from the 
customer along with a fine granularity performance log, there's no way 

An ideal cloud will make sure that something like 4 Small == 2 Medium == 
1 Large instance and that the machine capacity is always a multiple of 
Large instance size.

With a division like this, you can always achieve maximum density 
provided that you can support live migration.

Regards,


--

From: Anthony Liguori
Date: Thursday, December 2, 2010 - 3:27 pm

My initial inclination is that this would be inappropriate for KVM but I 
think I'm slowly convincing myself otherwise.

Ultimately, hard limits and deterministic scheduling are related goals 
but not quite the same.  A hard limit is very specific, you want to 
receive no more than an exit amount of CPU time per VCPU  With 
deterministic scheduling, you want to make sure that a set of VMs are 
not influenced by each other's behavior.

You want hard limits when you want to hide the density/capacity of a 
node from the end customer.  You want determinism when you simply want 
to isolate the performance of each customer from the other customers.

That is, the only thing that should affect the performance graph of a VM 
is how many neighbors it has (which is controlled by management 
software) rather than what its neighbors are doing.

If you have hard limits, you can approximate deterministic scheduling 
but it's complex in the face of changing numbers of guests.  
Additionally, hard limits present issues with directed yield that don't 
exist with a deterministic scheduling approach.

You can still donate your time slice to another VCPU because the VCPUs 
are not actually capped.  That may mean that an individual VCPU gets 
more PCPU time than an exact division but for the VM overall, it won't 
get more than it's total share.  So the principle of performance 
isolation for the guest isn't impacted.

Regards,

Anthony Liguori


--

From: Avi Kivity
Date: Friday, December 3, 2010 - 2:40 am

What's the difference between this and the Linux idle threads?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Srivatsa Vaddagiri
Date: Friday, December 3, 2010 - 4:21 am

If we have 3 VMs and want to give them 25% each of a CPU, then having just idle
thread would end up giving them 33%. One way of achieving 25% rate limit is to
create a "dummy" or "filler" VM, and let it compete for resource, thus
rate-limiting everyone to 25% in this case. Essentially we are tackling
rate-limit problem by creating additional "filler" VMs/threads that will compete
for resource, thus keeping in check how much cpu resource is consumed by "real"
VMs. Admittedly not as neat as having a in-kernel support for rate-limit.

- vatsa

--

From: Srivatsa Vaddagiri
Date: Friday, December 3, 2010 - 4:57 am

That's not sufficient. Lets we have 3 guests A, B, C that need to be rate
limited to 25% on a single cpu system. We create this idle guest D that is 100%
cpu hog as per above definition. Now when one of the guest is idle, what ensures
that the idle cycles of A is given only to D and not partly to B/C?

- vatsa
--

From: Srivatsa Vaddagiri
Date: Friday, December 3, 2010 - 9:27 am

To tackle this problem, I was thinking of having a fill-thread associated with 
each vcpu (i.e both belong to same cgroup). Fill-thread consumes idle cycles 
left by vcpu, but otherwise doesn't compete with it for cycles.

- vatsa
--

From: Chris Wright
Date: Friday, December 3, 2010 - 10:29 am

That's what Marcelo's suggestion does w/out a fill thread.
--

From: Srivatsa Vaddagiri
Date: Friday, December 3, 2010 - 10:33 am

Are we willing to add that to KVM sources?

I was working under the constraints of not modifying the kernel (especially 
avoid adding short term hacks that become unnecessary in longer run, in this
case when kernel-based hard limits goes in).

- vatsa
--

From: Srivatsa Vaddagiri
Date: Friday, December 3, 2010 - 10:57 am

There's one complication though even with that. How do we compute the
real utilization of VM (given that it will appear to be burning 100% cycles)?
We need to have scheduler discount the cycles burnt post halt-exit, so more
stuff is needed than those simple 3-4 lines!

- vatsa
--

From: Chris Wright
Date: Friday, December 3, 2010 - 10:58 am

Heh, was just about to say the same thing ;)
--

From: Anthony Liguori
Date: Friday, December 3, 2010 - 11:07 am

My first reaction is that it's not terribly important to account the 
non-idle time in the guest because of the use-case for this model.

Eventually, it might be nice to have idle time accounting but I don't 
see it as a critical feature here.

Non-idle time simply isn't as meaningful here as it normally would be.  
If you have 10 VMs in a normal environment and saw that you had only 50% 
CPU utilization, you might be inclined to add more VMs.  But if you're 
offering deterministic execution, it doesn't matter if you only have 
"50%" utilization.  If you add another VM, the guests will get exactly 
the same impact as if they were using 100% utilization.

Regards,


--

From: Srivatsa Vaddagiri
Date: Friday, December 3, 2010 - 11:12 am

Agreed ...but I was considering the larger user-base who may be surprised to see
their VMs being reported as 100% hogs when they had left it idle.

- vatsa
--

From: Chris Wright
Date: Friday, December 3, 2010 - 11:20 am

Depends on the chargeback model.  This would put guest vcpu runtime vs
host running guest vcpu time really out of skew.  ('course w/out steal
and that time it's already out of skew).  But I think most models are

Who is "you"?  cloud user, or cloud service provider's scheduler?
On the user side, 50% cpu utilization wouldn't trigger me to add new
VMs.  On the host side, 50% cpu utilization would have to be measure

Sorry, didn't follow here?

thanks,
-chris
--

From: Anthony Liguori
Date: Friday, December 3, 2010 - 11:55 am

Right.  I'm not familiar with any models that are actually based on 
CPU-consumption based accounting.  In general, the feedback I've 
received is that predictable accounting is pretty critical so I don't 
anticipate something as volatile as CPU-consumption ever being something 

The question is, why would something care about host CPU utilization?  
The answer I can think of is, something wants to measure host CPU 
utilization to identify an underutilized node.  One the underutilized 
node is identified, more work can be given to it.

Adding more work to an underutilized node doesn't change the amount of 
work that can be done.  More concretely, one PCPU, four independent 
VCPUs.  They are consuming, 25%, 25%, 25%, 12% respectively.  My 
management software says, ah hah, I can stick a fifth VCPU on this box 
that's only using 5%.  The other VCPUs are unaffected.

However, in a no-yield-on-hlt model, if I have four VCPUs, they each get 
25%, 25%, 25%, 25% on the host.  Three of the VCPUs are running 100% in 
the guest and one is running 50%.

If I add a fifth VCPU, even if it's only using 5%, each VCPU drops to 
20%.  That means the three VCPUS that are consuming 100% now see a 25% 
drop in their performance even though you've added an idle guest.

Basically, the traditional view of density simply doesn't apply in this 
model.

Regards,


--

From: Marcelo Tosatti
Date: Friday, December 3, 2010 - 11:10 am

Probably yes. The point is, you get the same effect as with the
non-trapping hlt but without the complications on low-level VMX/SVM
code.

Even better if you can do it with fill thread idea.

--

From: Marcelo Tosatti
Date: Friday, December 3, 2010 - 11:24 am

Well, no. Better to consume hlt time but yield if need_resched or in 
case of any event which breaks out of kvm_vcpu_block.


--

From: Chris Wright
Date: Friday, December 3, 2010 - 10:28 am

Yeah, I pictured priorties handling this.
--

From: Srivatsa Vaddagiri
Date: Friday, December 3, 2010 - 10:36 am

All guest are of equal priorty in this case (that's how we are able to divide 
time into 25% chunks), so unless we dynamically boost D's priority based on how
idle other VMs are, its not going to be easy!

- vatsa
--

From: Chris Wright
Date: Friday, December 3, 2010 - 10:38 am

Right, I think there has to be an external mgmt entity.  Because num
vcpus is not static.  So priorities have to be rebalanaced at vcpu
create/destroy time.

thanks,
-chris
--

From: Srivatsa Vaddagiri
Date: Friday, December 3, 2010 - 10:43 am

and at idle/non-idle time as well, which makes the mgmt entity's job rather
harder? Anyway, if we are willing to take a patch to burn cycles upon halt (as
per Marcello's patch), that's be the best (short-term) solution ..otherwise,
something like a filler-thread per-vcpu is more easier than dynamic change of
priorities ..

- vatsa
--

From: Anthony Liguori
Date: Friday, December 3, 2010 - 10:47 am

We've actually done a fair amount of testing with using priorities like 
this.  The granularity is extremely poor because priorities don't map 
linearly to cpu time allotment.  The interaction with background tasks 
also gets extremely complicated.

Regards,


--

Previous thread: Possiblle misleading information of motherboards supporting VT-d by Prasad Joshi on Thursday, December 2, 2010 - 5:10 am. (1 message)

Next thread: [PATCH 0/2] Fix reboot on non-preemptible kernels by Avi Kivity on Thursday, December 2, 2010 - 9:00 am. (3 messages)