Add a yield_to function to the scheduler code, allowing us to
give the remainder of our timeslice to another thread.
We may want to use this to provide a sys_yield_to system call
one day.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
- move to a per sched class yield_to
- fix the locking
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..408326f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1086,6 +1086,8 @@ struct sched_class {
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_move_group) (struct task_struct *p, int on_rq);
#endif
+
+ void (*yield_to) (struct rq *rq, struct task_struct *p);
};
struct load_weight {
@@ -1947,6 +1949,7 @@ extern void set_user_nice(struct task_struct *p, long nice);
extern int task_prio(const struct task_struct *p);
extern int task_nice(const struct task_struct *p);
extern int can_nice(const struct task_struct *p, const int nice);
+extern void requeue_task(struct rq *rq, struct task_struct *p);
extern int task_curr(const struct task_struct *p);
extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
@@ -2020,6 +2023,10 @@ extern int wake_up_state(struct task_struct *tsk, unsigned int state);
extern int wake_up_process(struct task_struct *tsk);
extern void wake_up_new_task(struct task_struct *tsk,
unsigned long clone_flags);
+
+extern u64 slice_remain(struct task_struct *);
+extern void yield_to(struct task_struct *);
+
#ifdef CONFIG_SMP
extern void kick_process(struct task_struct *tsk);
#else
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..6399641 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5166,6 +5166,46 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
return ret;
}
+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the ...That part looks ok, except for the yield cross cpu bit. Trying to yield <ramble> slice_remain() measures the distance to your last preemption, which has no relationship with entitlement. sched_slice() is not used to issue entitlement, it's only a ruler. You have entitlement on your current runqueue only, that entitlement being the instantaneous distance to min_vruntime in a closed and fluid system. You can't inject some instantaneous relationship from one closed system into an another without making the math go kind of fuzzy, so you need tight constraints on how fuzzy it can get. We do that with migrations, inject fuzz. There is no global fair-stick, but we invent one by injecting little bits of fuzz. It's constrained by chaos and the magnitude constraints of the common engine. The more you migrate, the more tightly you couple systems. As long as we stay fairly well balanced, we can migrate without the fuzz getting out of hand, and end up with a globally ~fair system. What you're injecting isn't instantaneously irrelevant lag-fuzz, which distributed over time becomes relevant. you're inventing entitlement out of the void. Likely not a big hairy deal unless you do it frequently, but you're doing something completely bogus and seemingly unconstrained. This has an excellent chance of moving the recipient rightward.. and the yielding task didn't yield anything. This may achieve the desired result or may just create a nasty latency spike... but it makes no arithmetic sense. -Mike --
So another (crazy) idea is to move the "yieldee" task on another cpu over to yielding task's cpu, let it run till the end of yielding tasks slice and then let it go back to the original cpu at the same vruntime position! - vatsa --
Yeah, pulling the intended recipient makes fine sense. If he doesn't preempt you, you can try to swap vruntimes or whatever makes arithmetic sense and will help. Dunno how you tell him how long he can keep the cpu though, and him somehow going back home needs to be a plain old migration, no fancy restoration of ancient history vruntime. -Mike --
can't we adjust the new task's [prev_]sum_exec_runtime a bit so that it is What is the issue if it gets queued at the old vruntime (assuming fair stick is still behind that)? Without that it will hurt fairness for the yieldee (and perhaps of the overall VM in this case). - vatsa --
And dork up accounting. Why? Besides, it won't work because you have no idea who may preempt whom, when, and for how long. (Why do people keep talking about timeslice? The only thing that exists Who all are you placing this task in front of or behind based upon a non-existent relationship? Your recipient may well have been preempted, and is now further behind than the completely irrelevant to the current situation stored vruntime would indicate, so why would you want to move it rightward? Certainly not in the interest of fairness. -Mike --
The current task just donated the rest of its timeslice. Surely that makes it a reasonable idea to call yield, and get one of the other tasks on the current CPU running for Good point, the current task calls yield() in the function that calls yield_to_fair, but I seem to have lost the code that penalizes the current task's runtime... I'll reinstate that. -- All rights reversed --
There's nothing wrong with trying to give up the cpu. It's the concept If you want to yield_to(task), task needs to move to the resource you're sitting on. That's the only thing that makes any sense to me. Pull the task, maybe do some vruntime twiddling (likely very bad idea) to improve success chances, and schedule. Dunno how effective that would be at solving real problems though. You might get him to run, might not, you'll be fighting load balancing, and See comment in parentheses above :) -Mike --
BTW, with this vruntime donation thingy, what prevents a task from forking off accomplices who do nothing but wait for a wakeup and yield_to(exploit)? Even swapping vruntimes in the same cfs_rq is dangerous as hell, because one party is going backward. -Mike --
What's the difference between that and forking off accomplices who run(exploit) directly? Many threads dominating the scheduler has been solved by group scheduling. We need to make sure directed yield doesn't violate that, but I don't see new problems. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Let it become possible to stop clocks, and you surely will :) -Mike --
That's scheduler jargon, I'm not familiar with scheduler internals. Let's talk about this at a high level, since whenever I describe it in scheduler terms I'm likely to get it wrong. If there are N tasks on the machine, each is entitled to 1/N of the machine's cpu resources (ignoring nice and cgroups for the moment). What I'd like is for one task to temporarily pass a portion of its entitlement to another. No other task's entitlement is affected; they still get their 1/N. If a task donates its entitlement and immediately commits suicide, still we won't have fundamentally changed anything; the task could have kept itself alive and consumed that entitlement, so the scheduler was already prepared to give it this entitlement so nothing was gained by the forking task. The problem of fork() creating new entitlements out of thin air has nothing to do with directed yield, and is solved by cgroups. We already do something similar with priority inheritance. This doesn't involve cpu entitlement, but we have the same situation where a task's ability to make progress depends on another task receiving cpu time. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
I just realized the answer to this question. We only give cpu time to tasks that are runnable, but not currently running. That ensures one task cannot block others from running by having time yielded to it constantly. --
Hm. Don't think that will 100% sure prevent clock stoppage, because the
running task doesn't necessarily advance min_vruntime.
What about something like the below instead? It compiles, but may eat
your first born child.
sched: implement fair class yield_to(task) using cfs_rq->next as a selection hint.
<CHANGELOG>
Not-signed-off-by: Mike Galbraith <efault@gmx.de>
---
include/linux/sched.h | 1
kernel/sched.c | 47 +++++++++++++++++++++++++++++++++++++++++
kernel/sched_fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 104 insertions(+)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
+ int (*yield_to_task) (struct task_struct *p, int preempt);
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5325,6 +5325,53 @@ void __sched yield(void)
}
EXPORT_SYMBOL(yield);
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+ struct task_struct *curr = current;
+ struct rq *rq, *p_rq;
+ unsigned long flags;
+ int yield = 0;
+
+ local_irq_save(flags);
+ rq = this_rq();
+
+again:
+ p_rq = task_rq(p);
+ double_rq_lock(rq, p_rq);
+ while (task_rq(p) != p_rq) ...Mike, I'd be glad to try it, but the last kernel I was able to make the nvidia drivers install on was 2.6.36.1-latencyfix, which had your earier -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) I have more humility in my little finger than you have in your whole ____BODY! -- from "Cerebus" #82 --
It wouldn't do diddly spit by itself anyway, needs someone to stick yield_to(amigo) in interesting spots. Might not do diddly spit even then :) -Mike --
Oh, in that case it will eventually filter down to me anyway after the rest of this list beats on it to see what falls out. But I occasionally like to be part of the filtering media as long as I don't bleed too much. ;-) Thanks Mike. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Without facts, the decision cannot be made logically. You must rely on your human intuition. -- Spock, "Assignment: Earth", stardate unknown --
It's christmas. I've replaced my original patch with your patch in my series, just haven't gotten around to running Avi's latest test on it yet :) -- All rights reversed --
What's so strange about it? From a high level there are N runnable tasks contending for M cpus. If task X really needs task Y to run, what does it matter if task Y last ran on the same cpu as task X or not? Do I correctly read between the lines that CFS maintains complete fairness only on a cpu, but not globally? I hope I'm wrong, but PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 8648 avi 20 0 106m 1092 148 R 80.4 0.0 0:26.03 1 bash 8656 avi 20 0 106m 1080 136 R 47.3 0.0 0:14.73 0 bash 8652 avi 20 0 106m 1080 136 R 47.0 0.0 0:15.36 0 bash doesn't look too good (three infinite loops in bash started at the roughly same time) -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Task X wants control of when runnable task Y gets the cpu. Task X clearly wants to be the scheduler. This isn't about _yielding_ diddly spit, it's about individual tasks wanting to make scheduling decisions, so calling it a yield is high grade horse-pookey. You're trying to give the scheduler a hint, the stronger that hint, the happier you'll be. I can see the problem, and I'm not trying to be Mr. Negative here, I'm only trying to point out problems I see with what's been proposed. If the yielding task had a concrete fee he could pay, that would be fine, but he does not. If he did have something, how often do you think it should be possible for task X to bribe the scheduler into selecting task Y? Will his pockets be deep enough to actually solve the problem? Once he's yielded, he's out of the picture for a while if he really gave anything up. What happens to donated entitlement when the recipient goes to sleep? If you try to give it back, what happens if the donor exited? Where did the entitlement come from if task A running alone on cpu A tosses some entitlement over the fence to his pal task B on cpu B.. and keeps on trucking on cpu A? Where does that leave task C, B's Nothing between the lines about it. There are N individual engines, coupled via load balancing. -Mike --
It does. The yielding task is entitled to its fair share of the cpu, as modified by priority and group scheduling. The yielding task is willing to give up some of this cpu, in return for increasing another task's Unless the other task donates some cpu share back. This is exactly what Eventually C would replace A, since its share will be exhausted. If C Is this not seen as a major deficiency? I can understand intra-cpu scheduling decisions at 300 Hz and inter-cpu decisions at 10 Hz (or even lower, with some intermediate rate for intra-socket scheduling). But this looks like a major deviation from fairness - instead of 33%/33%/33% you get 50%/25%/25% depending on random placement. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
The donor does absolutely NOT have any claim to any specific quantum of any cpu at any given time. There is no reservation, only a running clock. If you let one task move another task's clock backward, you open Hm, so it needs to be very cheap, and highly repeatable. What if: so you're trying to get spinners out of the way right? You somehow know they're spinning, so instead of trying to boost some task, can you do a directed yield in terms of directing a spinner that you have the right to diddle to yield. Drop his lag, and resched him. He's not accomplishing anything anyway. If the only thing running is virtualization, and nobody else can use the interface being invented, all is fair, but this passing of vruntime around is problematic when innocent bystanders may want to play too. Doesn't seem to be. That's what SMP-nice was all about. It's not Yeah, you have to bounce tasks around regularly to make that work out. -Mike --
Of course not, but we aren't interested in a specific quantum of time. All we want is to boost task A, and to unboost the donor in order to maintain fairness. We don't want A to run now (though it would be There are a couple of problems with this approach: - current yield() is a no-op - even if it weren't, the process (containing the spinner and the lock-holder) would yield as a whole. If it yielded for exactly the time needed (until the lock holder releases the lock), it wouldn't matter, since the spinner isn't accomplishing anything, but we don't know what the exact time is. So we want to preserve our entitlement. With a pure yield implementation the process would get less than its fair share, even discounting spin time, which we'd be happy to donate to We definitely want to maintain fairness. Both with a dedicated virt What's the problem exactly? What's the difference, system-wide, with the donor continuing to run for that same entitlement? Other tasks see I guess random perturbations cause task migrations periodically and things balance out. But it seems wierd to have this devotion to That's fine as long as the bounce period is significantly larger than the cost of a migration. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
I don't get this part. How does the whole process yield if one thread And that's the hard part. If can drop lag, you may hurt yourself, but That makes it difficult to the point of impossible. You want a specific task to run NOW for good reasons, but any number of tasks may want the same godlike power for equally good reasons. You could create a force select which only godly tasks could use that didn't try to play games with vruntimes, just let the bugger run, and let him also eat the latency hit he'll pay for that extra bit of cpu IFF you didn't care about being able to mix loads. Or, you could just bump his nice level with an automated return to previous level on resched. What's that? :) No task may run until there are enough of you to fill It doesn't ignore it complete, it just doesn't try to do all the math continuously (danger Will Robinson: Peter has scary patches). Prodding it in the right general direction with migrations is cheaper. -Mike --
The process is the sum of its threads. If a thread yield loses 1 msec of runtime due to the yield, the process loses 1 msec due to the yield. If the lock is held for, say, 100 usec, it would be better for the process to spin rather than yield. With directed yield the process loses nothing by yielding to one of its We already have a "hurt only yourself" thing. We sleep for 100 usec We aren't happy to donate it to the rest of the system, since it will cause a guest with lots of internal contention to make very little I don't want it to run now. I want it to run before some other task. I don't care if N other tasks run before both. So no godlike powers Since task A is running now, clearly the scheduler thinks it deserves to run. What I want to do is take just enough of the "deserves" part to Why is that a consequence of global fairness? three tasks get 100% cpu Doesn't seem to work from my brief experiment. -- error compiling committee.c: too many arguments to function --
If behaviors are very similar, and tasks are not likely to try to exploit it (as described), you can likely swap lags without horrible consequences. ..I was just trying to say that "global fairness" is not well defined. Never mind. -Mike --
Ponders that... What if: we test that both tasks are in the same thread group, if so, use cfs_rq->next to pass the scheduler a HINT of what you would LIKE to happen. If the current task on that rq is also in your group, resched it, then IFF the task you would like to run isn't too far right, it'll be selected. If the current task is not one of yours, tough, you can set cfs_rq->next and hope it doesn't get overwritten, but you may not preempt a stranger. If you happen to be sharing an rq, cool, you accomplished your yield_to(). If not, there's no practical way (I can think of) to ensure that the target runs before you run again if you try to yield, but you did your best to try to get him to the cpu sooner, and in a manner that preserves fairness without dangerous vruntime diddling. Would that be good enough to stop (or seriously improve) cpu wastage? -Mike --
In my use case, both tasks are in the same thread group, so that works. The cross-cpu limitation is bothersome. Since there are many cpus in modern machines, particularly ones used for virt, the probability of the two tasks being on the same cpu is quite low. -- error compiling committee.c: too many arguments to function --
Because preempting a perfect stranger is not courteous, all tasks have What would you suggest? There is no global execution timeline, so if you want to definitely run after this task, you're stuck with moving to his timezone or moving him to yours. Well, you could sleep a while, but we know how productive sleeping is. -Mike --
I don't want to preempt anybody, simply make the task run before me. Further, this is a kernel internal API, so no need for these types of Since I'm running and the target isn't, it's clear the scheduler thinks the target had more cpu than I did [73]. That's why I want to donate cpu time. [73] at least it'd be clear if the scheduler were globally fair. As it is, I might be the only task running on my cpu, therefore in a cpu glut, while the other task shares the cpu with some other task and is I don't know. The whole idea of donating runtime was predicated on CFS being completely fair. Now I find that (a) it isn't (b) donating runtimes between tasks on different cpus is problematic. Moving tasks between cpus is expensive and sometimes prohibited by pinning. I'd like to avoid it if possible, but it's better than nothing. -- error compiling committee.c: too many arguments to function --
I thought you wanted to get the target to the cpu asap? You just can't Doesn't matter whether it's kernel or not afaikt. If virtualization has to coexist peacefully with other loads, it can't just say "my hints are That's not necessarily true, in fact, it's very often false. Last/next buddy will allow a task to run ahead of leftmost so we don't always True and true. However, would you _want_ the scheduler to hold runnable tasks hostage, and thus let CPU go to waste in the name of perfect Expensive in many ways, so let's try to not do that. So why do you need this other task to run before you do, even cross cpu? If he's a lock holder, getting him to the cpu will give him a chance to drop, no? Isn't that what you want to get done? Drop that lock so you or someone else can get something other than spinning done? -Mike --
You're right, of course. I'm fine with running in parallel. I'm fine with him running before or instead of me. I'm not fine with running What does that have to do with being in the same group or not? I want to maintain fairness (needed for pure virt workloads, one guest cannot dominate another), but I don't see how being in the same thread group is relevant. Again, I don't want more than one entitlement. I want to move part of Correct. I don't want the other task to run before me, I just don't want to run before it. -- error compiling committee.c: too many arguments to function --
My thought is that you can shred your own throughput, but not some other OK, so what I gather is that if you can preempt another of your own threads to get the target to cpu, that would be a good thing whether he's on the same cpu as yield_to() caller or not. If the target is sharing a cpu with you, that's even better. Correct? Would a kick/hint option be useful? -Mike --
Correct. I'm not interested in scheduling wrt other tasks, just my tasks. However, if I'm all alone on my cpu, and the other task is runnable but not running, behind some unrelated task, then I do need that task to be Depends on what it does... -- error compiling committee.c: too many arguments to function --
Let you decide whether you only want to drop a hint and leave it at that, or also attempt a preemption. -Mike --
Who is "you" in this? the scheduler? I'm fine with hints so long as they are usually acted upon (if it isn't, I'll go back to the guest, spin a bit more, and retry). -- error compiling committee.c: too many arguments to function --
I guess it works. We can start out without preemption, if we see we spin to much, preempt. -- error compiling committee.c: too many arguments to function --
That wants to be plain: schedule(), possibly conditional on having
You're calling this from
yield_to()->sched_class::yield_to()->yield_to_fair()->slice_remain(),
Like Mike said, the returned figure doesn't really mean anything, its
definitely not the remaining time of a slice. It might qualify for a
Here you assume @p lives on @rq, but you passed:
+ current->sched_class->yield_to(rq, p);
You cannot simply subtract wall-time from virtual time, see the usage of
Also, modifying the vruntime of one task without also modifying the
vruntime of the other task breaks stuff. You're injecting time into p
without taking time out of current.
Maybe something like:
static void yield_to_fair(struct rq *p_rq, struct task_struct *p)
{
struct rq *rq = this_rq();
struct sched_entity *se = &current->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
struct sched_entity *pse = &p->se;
struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
/*
* Transfer wakeup_gran worth of time from current to @p,
* this should ensure current is no longer eligible to run.
*/
unsigned long wakeup_gran = ACCESS_ONCE(sysctl_sched_wakeup_granularity);
update_rq_clock(rq);
update_curr(cfs_rq);
if (pse != p_cfs_rq->curr) {
__dequeue_entity(p_cfs_rq, pse);
} else {
update_rq_clock(p_rq);
update_curr(p_cfs_rq);
}
se->vruntime += calc_delta_fair(wakeup_gran, se);
pse->vruntime -= calc_delta_fair(wakeup_gran, pse);
clear_buddies(cfs_rq, se);
if (pse != p_cfs_rq->curr) {
__enqueue_entity(p_cfs_rq, pse);
check_preempt_curr(prq, p, 0)
}
}
This isn't strictly correct for the group scheduling case though, that
wants a for_each_sched_entity() loop for both se and pse, but I'd have
to like actually think about that ;-)
A quick hack might be simply dis-allowing yield_to between different
groups, add something like the below to the above function:
#ifdef CONFIG_FAIR_GROUP_SCHED
if (cfs_rq->tg != p_cfs_rq->tg)
return;
#endif
--
On 12/14/2010 07:22 AM, Peter Zijlstra wrote: ... fixed all the obvious stuff. No idea what the hell I was thinking while doing that "cleanup" - probably too busy looking at the tests that I was running on a previous codebase :( For the next version of the patches, I have switched to your version of yield_to_fair :) -- All rights reversed --
