When running SMP virtual machines, it is possible for one VCPU to be spinning on a spinlock, while the VCPU that holds the spinlock is not currently running, because the host scheduler preempted it to run something else. Both Intel and AMD CPUs have a feature that detects when a virtual CPU is spinning on a lock and will trap to the host. The current KVM code sleeps for a bit whenever that happens, which results in eg. a 64 VCPU Windows guest taking forever and a bit to boot up. This is because the VCPU holding the lock is actually running and not sleeping, so the pause is counter-productive. In other workloads a pause can also be counter-productive, with spinlock detection resulting in one guest giving up its CPU time to the others. Instead of spinning, it ends up simply not running much at all. This patch series aims to fix that, by having a VCPU that spins give the remainder of its timeslice to another VCPU in the same guest before yielding the CPU - one that is runnable but got preempted, hopefully the lock holder. v3: - more cleanups - change to Mike Galbraith's yield_to implementation - yield to spinning VCPUs, this seems to work better in some situations and has little downside potential v2: - make lots of cleanups and improvements suggested - do not implement timeslice scheduling or fairness stuff yet, since it is not entirely clear how to do that right (suggestions welcome) --
From: Mike Galbraith <efault@gmx.de>
Add a yield_to function to the scheduler code, allowing us to
give enough of our timeslice to another thread to allow it to
run and release whatever resource we need it to release.
We may want to use this to provide a sys_yield_to system call
one day.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Not-signed-off-by: Mike Galbraith <efault@gmx.de>
---
Mike, want to change the above into a Signed-off-by: ? :)
This code seems to work well.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5f926c..0b8a3e6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1083,6 +1083,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*yield_task) (struct rq *rq);
+ int (*yield_to_task) (struct task_struct *p, int preempt);
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
@@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
# define rt_mutex_adjust_pi(p) do { } while (0)
#endif
+extern void yield_to(struct task_struct *p, int preempt);
extern void set_user_nice(struct task_struct *p, long nice);
extern int task_prio(const struct task_struct *p);
extern int task_nice(const struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index f8e5a25..ffa7a9d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6901,6 +6901,53 @@ void __sched yield(void)
}
EXPORT_SYMBOL(yield);
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+ struct task_struct *curr = ...At least I want. Ruby has GIL(giant interpreter lock). And giant lock naturally enforce an app to implement cooperative multithreading model. Therefore it has similar problem with your one. Python solved this issue by slowing lock mechanism (two pthread-cond wakeup each GIL releasing), but I don't want it. Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can avoid almost Nehalem (and other P2P cache arch) lock unfairness problem. (probaby creating pthread_condattr_setautoyield_np or similar knob is good one) --
Often, the thread calling pthread_cond_signal() wants to continue executing, not yield. -- error compiling committee.c: too many arguments to function --
NAK NAK NAK, yield_to is utter crap, and the only reason kvm 'needs' it is because its wants to be utter crap (run unmodified guests). There is plenty of sane serialization primitives for userspace, fix your locking mess instead of pushing crap. The only reason I'm maybe half-way considering this is because its a pure in-kernel interface which we can 'fix' once unmodified guests aren't important anymore. --
If not pulled and this_rq() != task_rq(p), only assigning ->next could kick p onto its CPU? If not, how is the lock contention eased then? A few words to explain please. thanks --
/*
* ask scheduler to compute the next for successfully kicking
@p onto its CPU
* what if p_rq is rt_class to do?
*/
next = pick_next_task(p_rq);
if (next != p)
p->se.vruntime = next->se.vruntime -1;
deactivate_task(p_rq, p, 0);
activate_task(p_rq, p, 0);
if (rq == p_rq)
schedule();
else
resched_task(p_rq->curr);
yield = 0;
--
Wouldn't that break for FIFO and RR tasks? There's a reason all the scheduler folks wanted a per-class yield_to_task function :) -- All rights reversed --
Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? Hillf
If the user chooses to overcommit the CPU with realtime tasks, the user cannot expect realtime response. For realtime, I have not implemented the yield_to callback at all because it would probably break realtime semantics and I assume people will not overcommit the CPU with realtime tasks anyway. I could see running a few realtime guests on a system, with the number of realtime VCPUs not exceeding the number of physical CPUs. -- All rights reversed --
Then it looks curr->sched_class != p->sched_class is not enough, and yield_to can not ease the lock contention in KVM in case where p->rq->curr is RT. --
RT guests are a pipe dream, you first need to get the hypervisor (kvm in this case) to be RT, which it isn't. Then you either need to very statically set-up the host and the guest scheduling constraints (not possible with RR/FIFO) or have a complete paravirt RT scheduler which communicates its requirements to the host. --
Even guest is not RT, you could not prevent it from being preempted by RT task which has nothing to do guests. --
Sure, but yield_to() being a NOP for RT tasks is perfectly fine. Pretty much all yield*() usage is a bug anyway. --
There's a limited use case. One host can have a few RT guests, say a host with 8 CPUs can have up to 6 or 7 RT VCPUs. Those guests get top priority. The host can then have some other, low priority, guests that scavenge remaining CPU time. In this case, no yield_to is required for RT guests, because they do not do overcommit. -- All rights reversed --
RT guests don't make sense, there's nowhere near enough infrastructure for that to work well. I'd argue that KVM running with RT priority is a bug. --
This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible I'd make it so only kvm.o could use it. It really sucks that kvm is a I'd simply bail if its not the same cgroup, who cares about that case The calling site already checks for same_thread_group(), we never even This only works by the grace that the caller checked p->se.on_rq. A I don't get this.. Why would you resched the remote cpu, surely you --
too I suppose. Target doesn't live here, preempt is still set/allowed, so we want the remote cpu to schedule. -Mike --
A couple questions. Do you have any numbers? If I were to, say, run a 256 CPU VM on my quad, would this help me get Does an Intel Q6600 have this trap gizmo (iow will this do anything at all for my little box if I were to try it out). -Mike --
First of all, you can't run 256 guests on x86 kvm. Second, you'll never see better performance when you overcommit. What this patchset does is reduce the degradation from utterly ridiculous to something manageable. This allows a host to deliver reasonable performance when overcommitted vcpus are actually used, but it's not a good idea to run 64 vcpus on a Likely not. Run http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=blob_plain;f=kvm/scripts/vmxcap;hb=HEAD, look for 'PAUSE-loop exiting'. I think the first processors to include them were the Nehalem-EXs, and Westmeres have them as well. -- error compiling committee.c: too many arguments to function --
Oh darn, need new box to get all the toys. Thanks. -Mike --
I meant a 256 vcpu guest. Certainly you can run 256 guests, you're I have a patchset somewhere that emulates pause-loop exiting by looking at rip during timer interrupts. Unfortunately timer interrupts are very rare compared to pause-loop exits, so it's not very helpful. -- error compiling committee.c: too many arguments to function --
Looks great. Assuming there are no objections, Mike, can you get 2/3 into a fast-forward-only branch of tip.git? I'll then merge it into kvm.git and apply 1/3 and 2/3. Since the merge window is about to open I'd like to merge this for 2.6.39 so it gets to stew a while in tip.git, kvm.git, and linux-next. -- error compiling committee.c: too many arguments to function --
Wrong guy. Fast-forward is Peter's department. -Mike --
Right, let me have a look, I put this thread on pause for a while, need to catch up. --
