[RFC -v3 PATCH 0/3] directed yield for Pause Loop Exiting

Previous thread: [RFC -v3 PATCH 3/3] Subject: kvm: use yield_to instead of sleep in kvm_vcpu_on_spin by Rik van Riel on Monday, January 3, 2011 - 2:30 pm. (1 message)

Next thread: Re: [PATCH] kvm: Fix section mismatch derived from kvm_guest_cpu_online() by H. Peter Anvin on Monday, January 3, 2011 - 2:36 pm. (2 messages)
From: Rik van Riel
Date: Monday, January 3, 2011 - 2:26 pm

When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got 
preempted, hopefully the lock holder.

v3:
- more cleanups
- change to Mike Galbraith's yield_to implementation
- yield to spinning VCPUs, this seems to work better in some
  situations and has little downside potential
v2:
- make lots of cleanups and improvements suggested
- do not implement timeslice scheduling or fairness stuff
  yet, since it is not entirely clear how to do that right
  (suggestions welcome)
--

From: Rik van Riel
Date: Monday, January 3, 2011 - 2:29 pm

From: Mike Galbraith <efault@gmx.de>

Add a yield_to function to the scheduler code, allowing us to
give enough of our timeslice to another thread to allow it to
run and release whatever resource we need it to release.

We may want to use this to provide a sys_yield_to system call
one day.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Not-signed-off-by: Mike Galbraith <efault@gmx.de>

--- 
Mike, want to change the above into a Signed-off-by: ? :)
This code seems to work well.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5f926c..0b8a3e6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1083,6 +1083,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq);
+	int (*yield_to_task) (struct task_struct *p, int preempt);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
@@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
 # define rt_mutex_adjust_pi(p)		do { } while (0)
 #endif
 
+extern void yield_to(struct task_struct *p, int preempt);
 extern void set_user_nice(struct task_struct *p, long nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index f8e5a25..ffa7a9d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6901,6 +6901,53 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+	struct task_struct *curr = ...
From: Mike Galbraith
Date: Monday, January 3, 2011 - 6:51 pm

Done.

	-Mike

--

From: KOSAKI Motohiro
Date: Monday, January 3, 2011 - 11:14 pm

At least I want. Ruby has GIL(giant interpreter lock). And giant lock
naturally enforce an app to implement cooperative multithreading model.
Therefore it has similar problem with your one. Python solved this issue
by slowing lock mechanism (two pthread-cond wakeup each GIL releasing), 
but I don't want it.

Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
avoid almost Nehalem (and other P2P cache arch) lock unfairness 
problem. (probaby creating pthread_condattr_setautoyield_np or similar
knob is good one)



--

From: Avi Kivity
Date: Tuesday, January 4, 2011 - 5:03 am

Often, the thread calling pthread_cond_signal() wants to continue 
executing, not yield.

-- 
error compiling committee.c: too many arguments to function

--

From: Peter Zijlstra
Date: Tuesday, January 4, 2011 - 10:16 am

NAK NAK NAK, yield_to is utter crap, and the only reason kvm 'needs' it
is because its wants to be utter crap (run unmodified guests).

There is plenty of sane serialization primitives for userspace, fix your
locking mess instead of pushing crap.

The only reason I'm maybe half-way considering this is because its a
pure in-kernel interface which we can 'fix' once unmodified guests
aren't important anymore.


--

From: Hillf Danton
Date: Tuesday, January 4, 2011 - 7:28 am

If not pulled and this_rq() != task_rq(p), only assigning ->next could
kick p onto its CPU?

If not, how is the lock contention eased then?

A few words to explain please.

thanks
--

From: Hillf Danton
Date: Tuesday, January 4, 2011 - 9:41 am

/*
         * ask scheduler to compute the next for successfully kicking
@p onto its CPU
         * what if p_rq is rt_class to do?
         */
	next = pick_next_task(p_rq);
	if (next != p)
		p->se.vruntime = next->se.vruntime -1;
	deactivate_task(p_rq, p, 0);
	activate_task(p_rq, p, 0);
	if (rq == p_rq)
		schedule();
	else
		resched_task(p_rq->curr);
	yield = 0;

--

From: Rik van Riel
Date: Tuesday, January 4, 2011 - 9:44 am

Wouldn't that break for FIFO and RR tasks?

There's a reason all the scheduler folks wanted a
per-class yield_to_task function :)

-- 
All rights reversed
--

From: Hillf Danton
Date: Tuesday, January 4, 2011 - 9:51 am

Where is the yield_to callback in the patch for RT schedule class?
If @p is RT, what could you do?

Hillf
From: Rik van Riel
Date: Tuesday, January 4, 2011 - 9:54 am

If the user chooses to overcommit the CPU with realtime
tasks, the user cannot expect realtime response.

For realtime, I have not implemented the yield_to callback
at all because it would probably break realtime semantics
and I assume people will not overcommit the CPU with realtime
tasks anyway.

I could see running a few realtime guests on a system, with
the number of realtime VCPUs not exceeding the number of
physical CPUs.

-- 
All rights reversed
--

From: Hillf Danton
Date: Tuesday, January 4, 2011 - 10:02 am

Then it looks curr->sched_class != p->sched_class is not enough,
and yield_to can not ease the lock contention in KVM in case where
p->rq->curr is RT.
--

From: Peter Zijlstra
Date: Tuesday, January 4, 2011 - 10:08 am

RT guests are a pipe dream, you first need to get the hypervisor (kvm in
this case) to be RT, which it isn't. Then you either need to very
statically set-up the host and the guest scheduling constraints (not
possible with RR/FIFO) or have a complete paravirt RT scheduler which
communicates its requirements to the host.


--

From: Hillf Danton
Date: Tuesday, January 4, 2011 - 10:12 am

Even guest is not RT, you could not prevent it from being preempted by
RT task which has nothing to do guests.
--

From: Peter Zijlstra
Date: Tuesday, January 4, 2011 - 10:22 am

Sure, but yield_to() being a NOP for RT tasks is perfectly fine. Pretty
much all yield*() usage is a bug anyway.
--

From: Rik van Riel
Date: Tuesday, January 4, 2011 - 10:53 am

There's a limited use case.

One host can have a few RT guests, say a host with 8 CPUs
can have up to 6 or 7 RT VCPUs.  Those guests get top
priority.

The host can then have some other, low priority, guests
that scavenge remaining CPU time.

In this case, no yield_to is required for RT guests,
because they do not do overcommit.

-- 
All rights reversed
--

From: Peter Zijlstra
Date: Tuesday, January 4, 2011 - 11:05 am

RT guests don't make sense, there's nowhere near enough infrastructure
for that to work well.

I'd argue that KVM running with RT priority is a bug.
--

From: Peter Zijlstra
Date: Tuesday, January 4, 2011 - 11:04 am

This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
I'd make it so only kvm.o could use it. It really sucks that kvm is a

I'd simply bail if its not the same cgroup, who cares about that case

The calling site already checks for same_thread_group(), we never even

This only works by the grace that the caller checked p->se.on_rq. A

I don't get this.. Why would you resched the remote cpu, surely you


--

From: Mike Galbraith
Date: Tuesday, January 4, 2011 - 11:53 am

too I suppose.

Target doesn't live here, preempt is still set/allowed, so we want the
remote cpu to schedule.

	-Mike

--

From: Mike Galbraith
Date: Monday, January 3, 2011 - 11:42 pm

A couple questions.


Do you have any numbers?

If I were to, say, run a 256 CPU VM on my quad, would this help me get

Does an Intel Q6600 have this trap gizmo (iow will this do anything at
all for my little box if I were to try it out).

	-Mike

--

From: Avi Kivity
Date: Tuesday, January 4, 2011 - 2:09 am

First of all, you can't run 256 guests on x86 kvm.  Second, you'll never 
see better performance when you overcommit.  What this patchset does is 
reduce the degradation from utterly ridiculous to something manageable.  
This allows a host to deliver reasonable performance when overcommitted 
vcpus are actually used, but it's not a good idea to run 64 vcpus on a 

Likely not.  Run 
http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=blob_plain;f=kvm/scripts/vmxcap;hb=HEAD, 
look for 'PAUSE-loop exiting'.  I think the first processors to include 
them were the Nehalem-EXs, and Westmeres have them as well.

-- 
error compiling committee.c: too many arguments to function

--

From: Mike Galbraith
Date: Tuesday, January 4, 2011 - 3:32 am

Oh darn, need new box to get all the toys.  Thanks.

	-Mike


--

From: Avi Kivity
Date: Tuesday, January 4, 2011 - 3:35 am

I meant a 256 vcpu guest.  Certainly you can run 256 guests, you're 

I have a patchset somewhere that emulates pause-loop exiting by looking 
at rip during timer interrupts.  Unfortunately timer interrupts are very 
rare compared to pause-loop exits, so it's not very helpful.

-- 
error compiling committee.c: too many arguments to function

--

From: Avi Kivity
Date: Tuesday, January 4, 2011 - 2:14 am

Looks great.

Assuming there are no objections, Mike, can you get 2/3 into a 
fast-forward-only branch of tip.git?  I'll then merge it into kvm.git 
and apply 1/3 and 2/3.

Since the merge window is about to open I'd like to merge this for 
2.6.39 so it gets to stew a while in tip.git, kvm.git, and linux-next.

-- 
error compiling committee.c: too many arguments to function

--

From: Mike Galbraith
Date: Tuesday, January 4, 2011 - 3:26 am

Wrong guy.  Fast-forward is Peter's department.

	-Mike

--

From: Peter Zijlstra
Date: Tuesday, January 4, 2011 - 10:20 am

Right, let me have a look, I put this thread on pause for a while, need
to catch up.
--

Previous thread: [RFC -v3 PATCH 3/3] Subject: kvm: use yield_to instead of sleep in kvm_vcpu_on_spin by Rik van Riel on Monday, January 3, 2011 - 2:30 pm. (1 message)

Next thread: Re: [PATCH] kvm: Fix section mismatch derived from kvm_guest_cpu_online() by H. Peter Anvin on Monday, January 3, 2011 - 2:36 pm. (2 messages)