"sched_yield() is not - and should not be - about 'recalculating the position in the scheduler queue' like you do now in CFS," Linus Torvalds stated in a discussion with Completely Fair Scheduler author Ingo Molnar, pointing to the man pages to back up his argument that sched_yield should instead move a thread to the end of its queue, adding, "quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the 'right place' in the rbtree. It should move it *last*."
Ingo described how it worked with the pre-2.6.23 scheduler, "the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task." He went on to compare this to the new process scheduler , "so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. In CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_." He also noted that this would change the behavior for some desktop applications that call sched_yield(), "there will be lots of regression reports about lost interactivity during load."
Linus replied, "the sched_yield() behaviour is actually very well-defined for RT tasks (now, whether it's a good interface to use or not is still open to debate, but at least it's a _defined_ interface ;), and I think we should at least try to approximate that behaviour for normal tasks, even if they aren't RT." He added:
"So I don't think a 'globally last' option is at all the best thing, but I think it's likely better than what CFS does now. And if we then have some agreement on what would be considered a further _improvement_, then that would be a good thing - maybe we're not perfect, but at least having a view of what we want to do is a good idea."
Linus then suggested, "the 'enqueue at the end' could easily be a statistical thing, not an absolute thing."
Ingo agreed this was a possible long term goal, "I thought a bit about the statistical approach, and it's good in principle, but it has an implementational problem/complication: if there are only yielding tasks in the system, then the 'queue rightwards in the tree, statistically' approach cycles through the key-space artificially fast. That can cause various problems." He then summarized:
"So right now there are only two viable options i think: either do the current weak thing, or do the rightmost thing. The statistical method might work too, but it needs more thought and more testing - I'm not sure we can get that ready for 2.6.23.
"So what we have as working code right now is the two extremes, and apps will really mostly prefer either the first (if they don't truly want to use yield but somehow it got into their code) or the second (if they want some massive delay). So while it does not have a good QoI, how about doing a compat_yield sysctl that allows the turning on of the 'queue rightmost' logic?"
Ultimately, the suggested sysctl was merged into the mainline kernel.
From: Chuck Ebbert Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] Date: Sep 18, 4:02 pm 2007 On 09/18/2007 06:46 PM, Ingo Molnar wrote: >>> We need a (tested) >>> solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've >>> attached below the latest version of the -rc6 yield patch - the switch >>> is not dependent on SCHED_DEBUG anymore but always available. >>> >> Is this going to be merged? And will you be making the default == 1 or >> just leaving it at 0, which forces people who want the older behavior >> to modify the default? > > not at the moment - Antoine suggested that the workload is probably fine > and the patch against -rc6 would have no clear effect anyway so we have > nothing to merge right now. (Note that there's no "older behavior" > possible, unless we want to emulate all of the O(1) scheduler's > behavior.) But ... we could still merge something like that patch, but a > clearer testcase is needed. The JVM's i have access to work fine. I just got a bug report today: https://bugzilla.redhat.com/show_bug.cgi?id=295071 ================================================== Description of problem: The CFS scheduler does not seem to implement sched_yield correctly. If one program loops with a sched_yield and another program prints out timing information in a loop. You will see that if both are taskset to the same core that the timing stats will be twice as long as when they are on different cores. This problem was not in 2.6.21-1.3194 but showed up in 2.6.22.4-65 and continues in the newest released kernel 2.6.22.5-76. Version-Release number of selected component (if applicable): 2.6.22.4-65 through 2.6.22.5-76 How reproducible: Very Steps to Reproduce: compile task1 int main() { while (1) { sched_yield(); } return 0; } and compile task2 #include <stdio.h> #include <sys/time.h> int main() { while (1) { int i; struct timeval t0,t1; double usec; gettimeofday(&t0, 0); for (i = 0; i < 100000000; ++i) ; gettimeofday(&t1, 0); usec = (t1.tv_sec * 1e6 + t1.tv_usec) - (t0.tv_sec * 1e6 + t0.tv_usec); printf ("%8.0f\n", usec); } return 0; } Then run: "taskset -c 0 ./task1" "taskset -c 0 ./task2" You will see that both tasks use 50% of the CPU. Then kill task2 and run: "taskset -c 1 ./task2" Now task2 will run twice as fast verifying that it is not some anomaly with the way top calculates CPU usage with sched_yield. Actual results: Tasks with sched_yield do not yield like they are suppose to. Expected results: The sched_yield task's CPU usage should go to near 0% when another task is on the same CPU. -
From: Ingo Molnar Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] Date: Sep 19, 12:18 pm 2007 * Chuck Ebbert <cebbert@redhat.com> wrote: > I just got a bug report today: > > https://bugzilla.redhat.com/show_bug.cgi?id=295071 > > ================================================== > > Description of problem: > > The CFS scheduler does not seem to implement sched_yield correctly. If > one program loops with a sched_yield and another program prints out > timing information in a loop. You will see that if both are taskset to > the same core that the timing stats will be twice as long as when they > are on different cores. This problem was not in 2.6.21-1.3194 but > showed up in 2.6.22.4-65 and continues in the newest released kernel > 2.6.22.5-76. sched_yield is a very poorly defined interface as it does not tell the kernel anything about what kind of locking user-space does. When an app calls the syscall it basically tells the scheduler: "uhm, ok, i'm blocked and i want you to do something else now. I dont want to tell you how long this task wants to wait, and i dont want to tell you on what condition it should be unblocked. Just ... do some stuff and let me alone. See ya." That's not an intelligent interface upon which the scheduler can do anything particularly intelligent (in fact, it's a very stupid interface upon which the scheduler can only do stupid things), and hence schedulers tend to implement sched_yield() quite arbitrarily and in a very scheduler-internals dependent way - which internals change when the scheduler is changed materially. The correct way to tell the kernel that the task is blocked is to use futexes for example, or any kernel-based locking or wait object - there are myriads of APIs for these. (The only well-defined behavior of yield is for SCHED_FIFO/RR tasks - and that is fully preserved in CFS.) Regarding compatible behavior: it's not possible to fully emulate the O(1) scheduler's yield behavior, because the yield implementation depended on so many scheduler internals. We changed yield when we went from 2.4 to 2.6, and even in 2.6 we changed it a number of times. To avoid the reoccuring problems of applications mistakenly relying on sched_yield(), we now context-switch on yield very weakly for SCHED_OTHER tasks (SCHED_FIFO/RR behavior is unchanged) - this is the only sane behavior that will get apps to stop using sched_yield() - and that's the difference that the above testcase shows. (there's actual user-space code executed, instead of the frequent overscheduling.) My patch below adds a sysctl flag that triggers a context-switch when yield is called (how futile and undefined and broken that might be), but that would only be for legacy reasons. We could still add this patch, but i was reluctant to send it to Linus without having at least one application that relies on having it and that benefits from it (Antoine's test turned out to not rely on it - see the '[FIXED]' qualifier in the subject line). Linus, what do you think? I have no strong feelings, I think the patch cannot hurt (it does not change anything by default) - but we should not turn the workaround flag on by default. If you agree that we should do this, then please pull this single patch from the sched.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git i've tested it with the flag off and on as well, and it works as expected. Ingo ------------------> Ingo Molnar (1): sched: yield workaround include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Ingo ------------------------------> Subject: sched: yield workaround From: Ingo Molnar <mingo@elte.hu> sched_yield() is fundamentally broken, and CFS has changed its behavior. Some apps that mistakenly rely on sched_yield() want "weak" sched yield (such as desktop apps) - while some apps want "strong" sched_yield() (for example some JDKs). There's no way for the scheduler to figure out which of the two variants the app really wants - because sched_yield() is all about hiding from the kernel the true structure of the user-space locking code. As a solution, provide a workaround, to introduce a more agressive sched_yield implementation: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround in the future, the use of this sysctl might generate a deprecation warning, so that apps start moving away from their reliance on sched_yield(). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1402,10 +1402,12 @@ extern void sched_idle_next(void); extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned int sysctl_sched_yield_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read */ unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; +unsigned int sysctl_sched_yield_granularity __read_mostly = 10000000ULL; + +/* + * sys_sched_yield workaround switch. + * + * This option switches the yield implementation of the + * old scheduler back on. + */ +unsigned int __read_mostly sysctl_sched_yield_bug_workaround; + /* * SCHED_BATCH wake-up granularity. * (default: 25 msec, units: nanoseconds) @@ -901,15 +911,66 @@ static void dequeue_task_fair(struct rq */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); + if (!sysctl_sched_yield_bug_workaround) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + __update_rq_clock(rq); + + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + return; + } + + if (sysctl_sched_yield_bug_workaround == 1) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node *curr, *next, *first; + struct task_struct *p_next; + s64 yield_key; + + __update_rq_clock(rq); + curr = &p->se.run_node; + first = first_fair(cfs_rq); + /* + * Move this task to the second place in the tree: + */ + if (curr != first) + next = rb_next(curr); + else + next = first; + /* + * We were the last one already - nothing to do, return + * and reschedule: + */ + if (unlikely(!next)) + return; + + p_next = rb_entry(next, struct task_struct, se.run_node); + /* + * Minimally necessary key value to be the second in the tree: + */ + yield_key = p_next->se.fair_key + + (int)sysctl_sched_yield_granularity; + + dequeue_entity(cfs_rq, &p->se, 0); + + /* + * Only update the key if we need to move more backwards + * than the minimally necessary position to be the second: + */ + if (p->se.fair_key < yield_key) + p->se.fair_key = yield_key; + + __enqueue_entity(cfs_rq, &p->se); + return; + } - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Just reschedule, do nothing else: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + resched_task(p); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -244,6 +244,17 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_granularity_ns", + .data = &sysctl_sched_yield_granularity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &min_sched_granularity_ns, + .extra2 = &max_sched_granularity_ns, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_wakeup_granularity_ns", .data = &sysctl_sched_wakeup_granularity, .maxlen = sizeof(unsigned int), @@ -303,6 +314,14 @@ static ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_bug_workaround", + .data = &sysctl_sched_yield_bug_workaround, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, #ifdef CONFIG_PROVE_LOCKING { .ctl_name = CTL_UNNUMBERED, -
From: Linus Torvalds Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] Date: Sep 19, 12:39 pm 2007 On Wed, 19 Sep 2007, Ingo Molnar wrote: > > Linus, what do you think? I have no strong feelings, I think the patch > cannot hurt (it does not change anything by default) - but we should not > turn the workaround flag on by default. I disagree. I think CFS made "sched_yield()" worse, and what you call "bug workaround" is likely the *better* behaviour. The fact is, sched_yield() is not - and should not be - about "recalculating the position in the scheduler queue" like you do now in CFS. It very much is about moving the thread *dead last* within its priority group. That's what it does for round-robin, and it's not about fairness, it's about - Opengroup: DESCRIPTION The sched_yield() function forces the running thread to relinquish the processor until it again becomes the head of its thread list. It takes no arguments. - Linux man-page: DESCRIPTION A process can relinquish the processor voluntarily without blocking by calling sched_yield. The process will then be moved to the end of the queue for its static priority and a new process gets to run. and quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the "right place" in the rbtree. It should move it *last*. Linus -
From: Ingo Molnar Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] Date: Sep 19, 12:56 pm 2007 * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Linus, what do you think? I have no strong feelings, I think the > > patch cannot hurt (it does not change anything by default) - but we > > should not turn the workaround flag on by default. > > I disagree. I think CFS made "sched_yield()" worse, and what you call > "bug workaround" is likely the *better* behaviour. > > The fact is, sched_yield() is not - and should not be - about > "recalculating the position in the scheduler queue" like you do now in > CFS. > > It very much is about moving the thread *dead last* within its > priority group. [...] > and quite frankly, the current CFS behaviour simply looks buggy. It > should simply not move it to the "right place" in the rbtree. It > should move it *last*. ok, we can do that. the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task. so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. in CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_. another practical problem is that this will break certain desktop apps that do calls to yield() [some firefox plugins do that, some 3D apps do that, etc.] but they dont expect to be moved 'very late' into the queue - they expect the O(1) scheduler's behavior of being delayed "a bit". (That's why i added the yield-granularity tunable.) we can make yield super-agressive, that is pretty much the only sane (because well-defined) thing to do (besides turning yield into a NOP), but there will be lots of regression reports about lost interactivity during load. sched_yield() is a mortally broken API. "fix the app" would be the answer, but still there will be lots of complaints. Ingo -
From: Linus Torvalds Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] Date: Sep 19, 1:28 pm 2007 On Wed, 19 Sep 2007, Ingo Molnar wrote: > > in CFS we dont have a per-nice-level rbtree, so we cannot move it dead > last within the same priority group - but we can move it dead last in > the whole tree. (then they'd be put even after nice +19 tasks.) People > might complain about _that_. Yeah, with reasonably good reason. The sched_yield() behaviour is actually very well-defined for RT tasks (now, whether it's a good interface to use or not is still open to debate, but at least it's a _defined_ interface ;), and I think we should at least try to approximate that behaviour for normal tasks, even if they aren't RT. And the closest you can get is basically to say something that is close to "sched_yield()" on a non-RT process still means that all other runnable tasks in that same nice-level will be scheduled before the task is scheduled again. and I think that would be the optimal approximation of the RT behaviour. Now, quite understandably we might not be able to actually get that *exact* behaviour (since we mix up different nice levels), but I think from a QoI standpoint we should really have that as a "target" behaviour. So I think it should be moved back in the RB-tree, but really preferably only back to behind all other processes at the same nice-level. So I think we have two problems with yield(): - it really doesn't have very nice/good semantics in the first place for anything but strict round-robin RT tasks. We can't do much about this problem, apart from trying to make people use proper locking and other models that do *not* depend on yield(). - Linux has made the problem a bit worse by then having fairly arbitrary and different semantics over time. This is where I think it would be a good idea to try to have the above kind of "this is the ideal behaviour - we don't guarantee it being precise, but at least we'd have some definition of what people should be able to expect is the "best" behaviour. So I don't think a "globally last" option is at all the best thing, but I think it's likely better than what CFS does now. And if we then have some agreement on what would be considered a further _improvement_, then that would be a good thing - maybe we're not perfect, but at least having a view of what we want to do is a good idea. Btw, the "enqueue at the end" could easily be a statistical thing, not an absolute thing. So it's possible that we could perhaps implement the CFS "yield()" using the same logic as we have now, except *not* calling the "update_stats()" stuff: __dequeue_entity(..); __enqueue_entity(..); and then just force the "fair_key" of the to something that *statistically* means that it should be at the end of its nice queue. I dunno. Linus -
From: Ingo Molnar Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] Date: Sep 19, 2:41 pm 2007 * Linus Torvalds <torvalds@linux-foundation.org> wrote: > Btw, the "enqueue at the end" could easily be a statistical thing, not > an absolute thing. So it's possible that we could perhaps implement > the CFS "yield()" using the same logic as we have now, except *not* > calling the "update_stats()" stuff: > > __dequeue_entity(..); > __enqueue_entity(..); > > and then just force the "fair_key" of the to something that > *statistically* means that it should be at the end of its nice queue. > > I dunno. i thought a bit about the statistical approach, and it's good in principle, but it has an implementational problem/complication: if there are only yielding tasks in the system, then the "queue rightwards in the tree, statistically" approach cycles through the key-space artificially fast. That can cause various problems. (this means that the workload-flag patch that uses yield_granularity is buggy as well. The queue-rightmost patch did not have this problem.) So right now there are only two viable options i think: either do the current weak thing, or do the rightmost thing. The statistical method might work too, but it needs more thought and more testing - i'm not sure we can get that ready for 2.6.23. So what we have as working code right now is the two extremes, and apps will really mostly prefer either the first (if they dont truly want to use yield but somehow it got into their code) or the second (if they want some massive delay). So while it does not have a good QoI, how about doing a compat_yield sysctl that allows the turning on of the "queue rightmost" logic? Find tested patch below. Peter, what do you think? Linus, if this would be acceptable for .23 then i'll push it out into sched.git together with another fix that Hiroshi Shimamoto just posted to lkml. Ingo --------------------------------------> Subject: sched: add /proc/sys/kernel/sched_compat_yield From: Ingo Molnar <mingo@elte.hu> add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. with sched_compat_yield=0: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop with sched_compat_yield=1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 1 kernel/sched.c | 5 --- kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 8 ++++++ 4 files changed, 67 insertions(+), 10 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1406,6 +1406,7 @@ extern unsigned int sysctl_sched_wakeup_ extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_compat_yield; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void) struct rq *rq = this_rq_lock(); schedstat_inc(rq, yld_cnt); - if (unlikely(rq->nr_running == 1)) - schedstat_inc(rq, yld_act_empty); - else - current->sched_class->yield_task(rq, current); + current->sched_class->yield_task(rq, current); /* * Since we are going to call schedule() anyway, there's Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -43,6 +43,14 @@ unsigned int sysctl_sched_latency __read unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; /* + * sys_sched_yield() compat mode + * + * This option switches the agressive yield implementation of the + * old scheduler back on. + */ +unsigned int __read_mostly sysctl_sched_compat_yield; + +/* * SCHED_BATCH wake-up granularity. * (default: 25 msec, units: nanoseconds) * @@ -897,19 +905,62 @@ static void dequeue_task_fair(struct rq } /* - * sched_yield() support is very simple - we dequeue and enqueue + * sched_yield() support is very simple - we dequeue and enqueue. + * + * If compat_yield is turned on then we requeue to the end of the tree. */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node **link = &cfs_rq->tasks_timeline.rb_node; + struct sched_entity *rightmost, *se = &p->se; + struct rb_node *parent; - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Are we the only task in the tree? + */ + if (unlikely(cfs_rq->nr_running == 1)) + return; + + if (likely(!sysctl_sched_compat_yield)) { + __update_rq_clock(rq); + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + + return; + } + /* + * Find the rightmost entry in the rbtree: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + do { + parent = *link; + link = &parent->rb_right; + } while (*link); + + rightmost = rb_entry(parent, struct sched_entity, run_node); + /* + * Already in the rightmost position? + */ + if (unlikely(rightmost == se)) + return; + + /* + * Minimally necessary key value to be last in the tree: + */ + se->fair_key = rightmost->fair_key + 1; + + if (cfs_rq->rb_leftmost == &se->run_node) + cfs_rq->rb_leftmost = rb_next(&se->run_node); + /* + * Relink the task to the rightmost position: + */ + rb_erase(&se->run_node, &cfs_rq->tasks_timeline); + rb_link_node(&se->run_node, parent, link); + rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -303,6 +303,14 @@ static ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_compat_yield", + .data = &sysctl_sched_compat_yield, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, #ifdef CONFIG_PROVE_LOCKING { .ctl_name = CTL_UNNUMBERED, -
From: Peter Zijlstra Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] Date: Sep 19, 2:58 pm 2007 On Wed, 19 Sep 2007 23:41:05 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > Btw, the "enqueue at the end" could easily be a statistical thing, not > > an absolute thing. So it's possible that we could perhaps implement > > the CFS "yield()" using the same logic as we have now, except *not* > > calling the "update_stats()" stuff: > > > > __dequeue_entity(..); > > __enqueue_entity(..); > > > > and then just force the "fair_key" of the to something that > > *statistically* means that it should be at the end of its nice queue. > > > > I dunno. > > i thought a bit about the statistical approach, and it's good in > principle, but it has an implementational problem/complication: if there > are only yielding tasks in the system, then the "queue rightwards in the > tree, statistically" approach cycles through the key-space artificially > fast. That can cause various problems. (this means that the > workload-flag patch that uses yield_granularity is buggy as well. The > queue-rightmost patch did not have this problem.) > > So right now there are only two viable options i think: either do the > current weak thing, or do the rightmost thing. The statistical method > might work too, but it needs more thought and more testing - i'm not > sure we can get that ready for 2.6.23. > > So what we have as working code right now is the two extremes, and apps > will really mostly prefer either the first (if they dont truly want to > use yield but somehow it got into their code) or the second (if they > want some massive delay). So while it does not have a good QoI, how > about doing a compat_yield sysctl that allows the turning on of the > "queue rightmost" logic? Find tested patch below. > > Peter, what do you think? I have to agree that for .23 we can't do much more than this. And tasks moving to the right without actually doing work and advancing fair_clock do scare me a little. Also, while I agree with Linus' definition of sched_yield, I'm afraid it will cause 'regressions' for all the interactivity people out there. Somehow this yield thing has made it into all sorts of unfortunate places like video drivers, so a heavy penalizing yield will hurt them. -
From: Ingo Molnar Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] Date: Sep 19, 2:49 pm 2007 * Ingo Molnar <mingo@elte.hu> wrote: > Peter, what do you think? > > Linus, if this would be acceptable for .23 then i'll push it out into > sched.git together with another fix that Hiroshi Shimamoto just posted > to lkml. it's getting late here so i've pushed the current version of those two patches out to: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git I'll redo this git tree if we want some other solution for yield. (but i think this is the safest approach for 2.6.23 - some apps will complain about too strong yield, some apps will complain about too weak yield. So by providing the two extremes we at least cover the practical range of behavior.) there's nothing else pending for 2.6.23 otherwise at the moment, scheduler-wise. Ingo ------------------> Hiroshi Shimamoto (1): sched: fix invalid sched_class use Ingo Molnar (1): sched: add /proc/sys/kernel/sched_compat_yield include/linux/sched.h | 1 kernel/sched.c | 10 ++++--- kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 8 ++++++ 4 files changed, 72 insertions(+), 10 deletions(-) -


Bad interfaces indeed.
Ingo's right that sched_yield is not a good interface for a scheduler that is not strict FIFO or round-robin style. Unfortunately, POSIX requires that this system call exist, so there's no helping supporting it somehow.
The spec says that it should be pushed to the end of its "queue", which itself makes certain assumptions about the internals of the scheduler, assumptions which CFS (and Con's SD, for that matter) do not meet. That itself is a problem with the spec, not the scheduler. Still, the "push it to the last place in the tree" idea does at least fit the notion of what the spec is talking about. Sorta.
I do think userspace programs should not use sched_yield unless they're running realtime, though. Call sleep instead, that has much better-defined semantics, and doesn't make assumptions about scheduler internals matching the classic Unix scheduler, that a more modern scheduler could well violate.
(sched_yield()) should move it *last*
I don't understand why Linus insists on this when there is no obvious practical reason to do so - especially since he has a history of saying things to the effect "why implement something required by a standard if it's stupid". If the POSIX spec specifies it should be last on the (active) queue, we need to try to figure out the reason that POSIX required that. Could it be a mistake that made it into the standard so the standard dictates policy? If there is a compelling and contemporary reason to enforce the POSIX behavior, then it should be done. If not, then don't bother - instead work for an amendment of the appropriate part of the POSIX spec. The computing world does evolve and if we blindly adhered to standards rather than updating them, we'd be stuck in the early 1970s.
JVMs
Look at the subject line of the discussion: some Java VMs use it and Linus is a pragmatist. His point boils down to: "apps use it thus we have to implement it, so lets make it as good of an implementation as possible".
In -rc7 he has merged Ingo's patch that adds a sysctl flag for this.
moving process to last
I was looking at the sample code in the bug report - that CFS behavior is seriously broken. A small loop, common in many polling tasks, taking up 50% of the CPU allocation because it is usually put before the other task. It will be interesting to see how Ingo works this one out.
easy implementation
why not implement sched_yield() as a *temporary* nice(5), until the next time it is run ?
That would achieve the intended purpose of "yielding" some of its natural priority, without being confined to a low priority forever.
With the CFS, if a nice(0) task is entitled to, say, 10% of the CPU time, and yields, it would be temporarily listed at entitled, say, 2% of the CPU, allowing other tasks to run more, until it is ran and presumably yields() again.
Or, a yielding task could, temprarily, halve its nice level. The yields() could even be made cumulative, so a continually yielding task would end up at nice(19), but rebounds to its normal priority as soon as it ceases to yield.
The yield API sucks: there's
The yield API sucks: there's no "when it's over" event, it's a single syscall with no other information provided to the kernel.
The API would be much saner if it consisted of a pair of syscalls, one of which enabled task X to say: "user space could not get its lock, it's waiting on task Y now", the other syscall said "user space got the lock, not waiting on Y anymore". (with a third one as well that allowed task Y to tell the kernel that X now got the lock and can run again)
Incidentally, the futex syscall's functionality is quite close to the one described above, so it's the natural replacement for yield calls.
Hmm... "rightmost at nice level"?
For the "rightmost at nice level", couldn't you track at enqueue time what the highest "fair_key" is at each nice level? Then you could do this calculation:
/* code to find "rightmost" snipped */ /* * Minimally necessary key value to be last in the tree: */ se->fair_key = rightmost->fair_key + 1;using the fair_key of the rightmost task at that nice level or all nice levels above. That is, you want to queue to the right of all tasks at the same or lower nice level than you. So if you yield, do:
rightmost_fair_key = se->fair_key + (int)sysctl_sched_yield_granularity; for (i = se->nice; i >= -20; i--) if (rightmost_fair_key < rightmost_by_nice[i + 20]) rightmost_fair_key = rightmost_by_nice[i + 20]; se->fair_key = rightmost_fair_key + 1;Make sense? Sure, you scan up to 40 ints whenever you sched_yield(). But assuming you're yielding for tasks that actually do something, this doesn't cost much at all since it happens infrequently.
(P.S. I don't know if se actually has a "nice" field. Consider this pseudocode. Presumably the nice value, or a shifted variant of it, is available somewhere in the struct. Also, was confused for a sec as to which end got 20 and which end got 19. Turns out the nice range is from +19 to -20.)
--
Program Intellivision and play Space Patrol!
For the "rightmost at nice
I guess you could, but I think the idea here is to implement sched_yield() without adding any overhead to the fast path. Any tracking would have to go into the fast path.
Later in the discussion Ingo describes how the old scheduler implemented it, and there too the idea was to do it without any extra code in the common case.
That approach makes sense to me: sched_yield() is a broken (and rarely used) interface, so there's no "prime time" support for it in code or data structures. But even under those constraints, the highest possible quality of implementation is strived for.
Mis-interpretation of yield()
The existence of "yield" is for one thing, and one thing only: a hint to the scheduler that the current thread cannot make forward progress until some other (equal priority) thread does something. That is, it's a scheduler hint that the current thread is busy-waiting on some other thread (of equal or greater priority).
Unfortunately, Linus & other kernel devs are wrong in believing the correct way to handle this problem is with blocking events. There are cases where busy-waiting is preferable to blocking events. The cases are (1) when two threads are somehow unable to share a synchronization object (maybe one thread cannot block, e.g. server thread or kernel thread), or (2) when the cost of a synchronization object (context switch to kernel, block, unblock, context switch back) is too high. My job works in areas where these two cases are true quite often, and Linux's always-changing semantics here are a severe source of problems.
The best example here is a graphics driver: the graphics driver cannot afford the scheduling overhead of blocking every time it finishes processing data, it is far more efficient to busy-wait until more data arrives. (Efficient in this case meaning lower latency to screen, at the cost of of CPU cycles). If a graphics driver has no work to do, but cannot afford to block, what is it to do? Worse, if this is a UP machine, the graphics card will get the CPU, knows it can't possibly do anything else in the time slice, still must busy-wait ... sched_yield is an important call here.
"The best example here is a
"The best example here is a graphics driver: the graphics driver cannot afford the scheduling overhead of blocking every time it finishes processing data, it is far more efficient to busy-wait until more data arrives. (Efficient in this case meaning lower latency to screen, at the cost of of CPU cycles."
You stay the hell away from my laptop!
"You stay the hell away from
"You stay the hell away from my laptop!"
If you don't want high-performance 3D, then don't run it on your laptop - 3D is latency-sensitive, 2D isn't. You only need busy-wait when somebody needs that extra bit of performance. (Graphics drivers only need to busy-wait when they compete with another app - e.g. a game - for CPU time. It's better to run at the same priority and share CPU time than to jar the scheduler back and forth with different priorities.)
Maybe I missed something...
Explain to me again how sched_yield() doesn't do a context switch to the kernel and back? Kinda curious on that one.
--
Program Intellivision and play Space Patrol!
The Xorg 3D devs disagree with you
The Xorg devs disagree with you, they have gradually moved away from the early, naive, polling-based lock implementation, towards an implementation that will also block when needed.
Modern 3D hardware has conveniently deep 3D command queues, so nothing keeps the Xorg devs from queueing up tons of commands and then going to sleep for completion. The deeper the queue, the fewer interrupts, wakeups and context switches happen. Reliance on sched_yield() for performance is a relict of the past.
Thank goodness it is the Xorg devs and not you who is designing the 3D software for my desktop and my laptop. A blocking event plus wake up costs a microsecond - and you'd rather poll for many milliseconds (even seconds), to "save" less than a microsecond?
Besides wasting CPU (and battery) power, sched_yield() also ruins interactivity and hurts 3D performance under any kind of non-zero load: what happens if the 3D operation completes just after the sched_yield() has been done and the task has been queued after a gazillion make jobs? Right: hefty delay in your 3D app. Or receive a mail and have an anti-spam script run while your 3D app is running? Right, hefty delay in your 3D app.
Tellingly, it is the NVidia closed-source driver that still insists on a sched_yield() based 3D lock implementation. The sane OSS 3D drivers have moved away from that model already. That is what open review gives us.
This matches my expectation. Thanks.
One question, since I don't program these queues: Presumably, there's a low-water-mark that triggers the interrupt for these queues?
As I see it, a proper command queue has a queue-low interrupt that gets armed if the queue fills above a high watermark, and later drops below a low watermark. Separate of that is a queue-empty interrupt. The idea is the queue-low interrupt wakes the filling task before the queue empties, so that the 3-D hardware never idles, but only does this if there seems to be a lot of work to do. If that interrupt gets dropped, or there wasn't enough work to do to fill above the "lots of work" threshold, the queue-empty interrupt just serves as a "I'm done" notifier.
How do the 3-D command queues work? Is it anything like that?
--
Program Intellivision and play Space Patrol!
Relinquishing timeslice
I thought yield was about relinquishing the current remaining timeslice.
Can't yield just say "this task was supposed to run for x more time, it now yields, so I'll consider that it actually used that time"?
So virtually, CFS would allocate the CPU 50-50 with another task of the same priority. But it reality, it would be less balanced. The level of balance will depend on how much time was already used when the task yielded. A task that yields immediately, will produce a balance of pretty much 0-100. A task that yield at the end of its timeslice will produce a balance of 50-50.
I think there are several
I think there are several uses of yield(), each user having his own expectation about yield() behavior.
Someone mentionned the use of yield() for active polling.
I understand several benchmark utilities also use yield(), what they do is akin of polling, they collect a few statistics, then yield().
Yield() could mean "don't run me for the next nn milliseconds -- but I expect to be allowed to run soon after that".
As I understand it, CFS does not model very well that family of tasks: Tasks that expect to be run every few milliseconds, but which (usually) have only a trivial amount of work to do. (And are unwilling to use timer facilities).
If those tasks were CFS-aware, they would self-nice themselves, a low (high) nice priority level that would guarantee them say, 2%, of the CPU time (depending on load).
That's the problem with yelding tasks, they want to be run "often", usually for a very short period of time (after which they would yield() again), but they can suddenly morph back into regular tasks, when their polling/yielding phase is over.
I wonder if an additional yield() parameter could provide the scheduler with an interval in which NOT to run them. As the CFS operate, those tasks would likely to be scheduled to run next. But then that's not different from waiting from a short interval timer.