logo
Published on KernelTrap (http://kerneltrap.org)

CFS and sched_yield

By Jeremy
Created Sep 21 2007 - 16:23

"sched_yield() is not - and should not be - about 'recalculating the position in the scheduler queue' like you do now in CFS," Linus Torvalds stated in a discussion [1] with Completely Fair Scheduler author Ingo Molnar, pointing to the man pages to back up his argument that sched_yield should instead move a thread to the end of its queue, adding, "quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the 'right place' in the rbtree. It should move it *last*."

Ingo described how it worked with the pre-2.6.23 scheduler, "the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task." He went on to compare this to the new process scheduler , "so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. In CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_." He also noted that this would change the behavior for some desktop applications that call sched_yield() [2], "there will be lots of regression reports about lost interactivity during load."

Linus replied, "the sched_yield() behaviour is actually very well-defined for RT tasks (now, whether it's a good interface to use or not is still open to debate, but at least it's a _defined_ interface ;), and I think we should at least try to approximate that behaviour for normal tasks, even if they aren't RT." He added:

"So I don't think a 'globally last' option is at all the best thing, but I think it's likely better than what CFS does now. And if we then have some agreement on what would be considered a further _improvement_, then that would be a good thing - maybe we're not perfect, but at least having a view of what we want to do is a good idea."

Linus then suggested, "the 'enqueue at the end' could easily be a statistical thing, not an absolute thing."

Ingo agreed this was a possible long term goal, "I thought a bit about the statistical approach, and it's good in principle, but it has an implementational problem/complication: if there are only yielding tasks in the system, then the 'queue rightwards in the tree, statistically' approach cycles through the key-space artificially fast. That can cause various problems." He then summarized:

"So right now there are only two viable options i think: either do the current weak thing, or do the rightmost thing. The statistical method might work too, but it needs more thought and more testing - I'm not sure we can get that ready for 2.6.23.

"So what we have as working code right now is the two extremes, and apps will really mostly prefer either the first (if they don't truly want to use yield but somehow it got into their code) or the second (if they want some massive delay). So while it does not have a good QoI, how about doing a compat_yield sysctl that allows the turning on of the 'queue rightmost' logic?"

Ultimately, the suggested sysctl was merged into the mainline kernel [3].


From: Chuck Ebbert <cebbert@...>
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]
 [3]Date: Sep 18, 7:02 pm 2007

On 09/18/2007 06:46 PM, Ingo Molnar wrote:
>>> We need a (tested) 
>>> solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've 
>>> attached below the latest version of the -rc6 yield patch - the switch 
>>> is not dependent on SCHED_DEBUG anymore but always available.
>>>
>> Is this going to be merged? And will you be making the default == 1 or 
>> just leaving it at 0, which forces people who want the older behavior 
>> to modify the default?
> 
> not at the moment - Antoine suggested that the workload is probably fine 
> and the patch against -rc6 would have no clear effect anyway so we have 
> nothing to merge right now. (Note that there's no "older behavior" 
> possible, unless we want to emulate all of the O(1) scheduler's 
> behavior.) But ... we could still merge something like that patch, but a 
> clearer testcase is needed. The JVM's i have access to work fine.

I just got a bug report today:

https://bugzilla.redhat.com/show_bug.cgi?id=295071 [4]

==================================================

Description of problem:

The CFS scheduler does not seem to implement sched_yield correctly. If one
program loops with a sched_yield and another program prints out timing
information in a loop. You will see that if both are taskset to the same core
that the timing stats will be twice as long as when they are on different cores.
This problem was not in 2.6.21-1.3194 but showed up in 2.6.22.4-65 and continues
in the newest released kernel 2.6.22.5-76. 

Version-Release number of selected component (if applicable):

2.6.22.4-65 through 2.6.22.5-76

How reproducible:

Very

Steps to Reproduce:
compile task1
int main() {
        while (1) {
            sched_yield();
        }
        return 0;
}

and compile task2

#include <stdio.h>
#include <sys/time.h>
int main() {
    while (1) {
        int i;
        struct timeval t0,t1;
        double usec;

        gettimeofday(&t0, 0);
        for (i = 0; i < 100000000; ++i)
            ;
        gettimeofday(&t1, 0);

        usec = (t1.tv_sec * 1e6 + t1.tv_usec) - (t0.tv_sec * 1e6 + t0.tv_usec);
        printf ("%8.0f\n", usec);
    }
    return 0;
}

Then run:
"taskset -c 0 ./task1"
"taskset -c 0 ./task2"

You will see that both tasks use 50% of the CPU. 
Then kill task2 and run:
"taskset -c 1 ./task2"

Now task2 will run twice as fast verifying that it is not some anomaly with the
way top calculates CPU usage with sched_yield.
  
Actual results:
Tasks with sched_yield do not yield like they are suppose to.

Expected results:
The sched_yield task's CPU usage should go to near 0% when another task is on
the same CPU.
-

From: Ingo Molnar <mingo@...> Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] [4]Date: Sep 19, 3:18 pm 2007 * Chuck Ebbert <cebbert@redhat.com> wrote: > I just got a bug report today: > > https://bugzilla.redhat.com/show_bug.cgi?id=295071 [5] > > ================================================== > > Description of problem: > > The CFS scheduler does not seem to implement sched_yield correctly. If > one program loops with a sched_yield and another program prints out > timing information in a loop. You will see that if both are taskset to > the same core that the timing stats will be twice as long as when they > are on different cores. This problem was not in 2.6.21-1.3194 but > showed up in 2.6.22.4-65 and continues in the newest released kernel > 2.6.22.5-76. sched_yield is a very poorly defined interface as it does not tell the kernel anything about what kind of locking user-space does. When an app calls the syscall it basically tells the scheduler: "uhm, ok, i'm blocked and i want you to do something else now. I dont want to tell you how long this task wants to wait, and i dont want to tell you on what condition it should be unblocked. Just ... do some stuff and let me alone. See ya." That's not an intelligent interface upon which the scheduler can do anything particularly intelligent (in fact, it's a very stupid interface upon which the scheduler can only do stupid things), and hence schedulers tend to implement sched_yield() quite arbitrarily and in a very scheduler-internals dependent way - which internals change when the scheduler is changed materially. The correct way to tell the kernel that the task is blocked is to use futexes for example, or any kernel-based locking or wait object - there are myriads of APIs for these. (The only well-defined behavior of yield is for SCHED_FIFO/RR tasks - and that is fully preserved in CFS.) Regarding compatible behavior: it's not possible to fully emulate the O(1) scheduler's yield behavior, because the yield implementation depended on so many scheduler internals. We changed yield when we went from 2.4 to 2.6, and even in 2.6 we changed it a number of times. To avoid the reoccuring problems of applications mistakenly relying on sched_yield(), we now context-switch on yield very weakly for SCHED_OTHER tasks (SCHED_FIFO/RR behavior is unchanged) - this is the only sane behavior that will get apps to stop using sched_yield() - and that's the difference that the above testcase shows. (there's actual user-space code executed, instead of the frequent overscheduling.) My patch below adds a sysctl flag that triggers a context-switch when yield is called (how futile and undefined and broken that might be), but that would only be for legacy reasons. We could still add this patch, but i was reluctant to send it to Linus without having at least one application that relies on having it and that benefits from it (Antoine's test turned out to not rely on it - see the '[FIXED]' qualifier in the subject line). Linus, what do you think? I have no strong feelings, I think the patch cannot hurt (it does not change anything by default) - but we should not turn the workaround flag on by default. If you agree that we should do this, then please pull this single patch from the sched.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git i've tested it with the flag off and on as well, and it works as expected. Ingo ------------------> Ingo Molnar (1): sched: yield workaround include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Ingo ------------------------------> Subject: sched: yield workaround From: Ingo Molnar <mingo@elte.hu> sched_yield() is fundamentally broken, and CFS has changed its behavior. Some apps that mistakenly rely on sched_yield() want "weak" sched yield (such as desktop apps) - while some apps want "strong" sched_yield() (for example some JDKs). There's no way for the scheduler to figure out which of the two variants the app really wants - because sched_yield() is all about hiding from the kernel the true structure of the user-space locking code. As a solution, provide a workaround, to introduce a more agressive sched_yield implementation: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround in the future, the use of this sysctl might generate a deprecation warning, so that apps start moving away from their reliance on sched_yield(). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1402,10 +1402,12 @@ extern void sched_idle_next(void); extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned int sysctl_sched_yield_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read */ unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; +unsigned int sysctl_sched_yield_granularity __read_mostly = 10000000ULL; + +/* + * sys_sched_yield workaround switch. + * + * This option switches the yield implementation of the + * old scheduler back on. + */ +unsigned int __read_mostly sysctl_sched_yield_bug_workaround; + /* * SCHED_BATCH wake-up granularity. * (default: 25 msec, units: nanoseconds) @@ -901,15 +911,66 @@ static void dequeue_task_fair(struct rq */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); + if (!sysctl_sched_yield_bug_workaround) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + __update_rq_clock(rq); + + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + return; + } + + if (sysctl_sched_yield_bug_workaround == 1) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node *curr, *next, *first; + struct task_struct *p_next; + s64 yield_key; + + __update_rq_clock(rq); + curr = &p->se.run_node; + first = first_fair(cfs_rq); + /* + * Move this task to the second place in the tree: + */ + if (curr != first) + next = rb_next(curr); + else + next = first; + /* + * We were the last one already - nothing to do, return + * and reschedule: + */ + if (unlikely(!next)) + return; + + p_next = rb_entry(next, struct task_struct, se.run_node); + /* + * Minimally necessary key value to be the second in the tree: + */ + yield_key = p_next->se.fair_key + + (int)sysctl_sched_yield_granularity; + + dequeue_entity(cfs_rq, &p->se, 0); + + /* + * Only update the key if we need to move more backwards + * than the minimally necessary position to be the second: + */ + if (p->se.fair_key < yield_key) + p->se.fair_key = yield_key; + + __enqueue_entity(cfs_rq, &p->se); + return; + } - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Just reschedule, do nothing else: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + resched_task(p); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -244,6 +244,17 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_granularity_ns", + .data = &sysctl_sched_yield_granularity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &min_sched_granularity_ns, + .extra2 = &max_sched_granularity_ns, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_wakeup_granularity_ns", .data = &sysctl_sched_wakeup_granularity, .maxlen = sizeof(unsigned int), @@ -303,6 +314,14 @@ static ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_bug_workaround", + .data = &sysctl_sched_yield_bug_workaround, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, #ifdef CONFIG_PROVE_LOCKING { .ctl_name = CTL_UNNUMBERED, -
From: Linus Torvalds <torvalds@...> Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] [5]Date: Sep 19, 3:39 pm 2007 On Wed, 19 Sep 2007, Ingo Molnar wrote: > > Linus, what do you think? I have no strong feelings, I think the patch > cannot hurt (it does not change anything by default) - but we should not > turn the workaround flag on by default. I disagree. I think CFS made "sched_yield()" worse, and what you call "bug workaround" is likely the *better* behaviour. The fact is, sched_yield() is not - and should not be - about "recalculating the position in the scheduler queue" like you do now in CFS. It very much is about moving the thread *dead last* within its priority group. That's what it does for round-robin, and it's not about fairness, it's about - Opengroup: DESCRIPTION The sched_yield() function forces the running thread to relinquish the processor until it again becomes the head of its thread list. It takes no arguments. - Linux man-page: DESCRIPTION A process can relinquish the processor voluntarily without blocking by calling sched_yield. The process will then be moved to the end of the queue for its static priority and a new process gets to run. and quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the "right place" in the rbtree. It should move it *last*. Linus -
From: Ingo Molnar <mingo@...> Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] [5]Date: Sep 19, 3:56 pm 2007 * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Linus, what do you think? I have no strong feelings, I think the > > patch cannot hurt (it does not change anything by default) - but we > > should not turn the workaround flag on by default. > > I disagree. I think CFS made "sched_yield()" worse, and what you call > "bug workaround" is likely the *better* behaviour. > > The fact is, sched_yield() is not - and should not be - about > "recalculating the position in the scheduler queue" like you do now in > CFS. > > It very much is about moving the thread *dead last* within its > priority group. [...] > and quite frankly, the current CFS behaviour simply looks buggy. It > should simply not move it to the "right place" in the rbtree. It > should move it *last*. ok, we can do that. the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task. so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. in CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_. another practical problem is that this will break certain desktop apps that do calls to yield() [some firefox plugins do that, some 3D apps do that, etc.] but they dont expect to be moved 'very late' into the queue - they expect the O(1) scheduler's behavior of being delayed "a bit". (That's why i added the yield-granularity tunable.) we can make yield super-agressive, that is pretty much the only sane (because well-defined) thing to do (besides turning yield into a NOP), but there will be lots of regression reports about lost interactivity during load. sched_yield() is a mortally broken API. "fix the app" would be the answer, but still there will be lots of complaints. Ingo -
From: Linus Torvalds <torvalds@...> Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] [5]Date: Sep 19, 4:28 pm 2007 On Wed, 19 Sep 2007, Ingo Molnar wrote: > > in CFS we dont have a per-nice-level rbtree, so we cannot move it dead > last within the same priority group - but we can move it dead last in > the whole tree. (then they'd be put even after nice +19 tasks.) People > might complain about _that_. Yeah, with reasonably good reason. The sched_yield() behaviour is actually very well-defined for RT tasks (now, whether it's a good interface to use or not is still open to debate, but at least it's a _defined_ interface ;), and I think we should at least try to approximate that behaviour for normal tasks, even if they aren't RT. And the closest you can get is basically to say something that is close to "sched_yield()" on a non-RT process still means that all other runnable tasks in that same nice-level will be scheduled before the task is scheduled again. and I think that would be the optimal approximation of the RT behaviour. Now, quite understandably we might not be able to actually get that *exact* behaviour (since we mix up different nice levels), but I think from a QoI standpoint we should really have that as a "target" behaviour. So I think it should be moved back in the RB-tree, but really preferably only back to behind all other processes at the same nice-level. So I think we have two problems with yield(): - it really doesn't have very nice/good semantics in the first place for anything but strict round-robin RT tasks. We can't do much about this problem, apart from trying to make people use proper locking and other models that do *not* depend on yield(). - Linux has made the problem a bit worse by then having fairly arbitrary and different semantics over time. This is where I think it would be a good idea to try to have the above kind of "this is the ideal behaviour - we don't guarantee it being precise, but at least we'd have some definition of what people should be able to expect is the "best" behaviour. So I don't think a "globally last" option is at all the best thing, but I think it's likely better than what CFS does now. And if we then have some agreement on what would be considered a further _improvement_, then that would be a good thing - maybe we're not perfect, but at least having a view of what we want to do is a good idea. Btw, the "enqueue at the end" could easily be a statistical thing, not an absolute thing. So it's possible that we could perhaps implement the CFS "yield()" using the same logic as we have now, except *not* calling the "update_stats()" stuff: __dequeue_entity(..); __enqueue_entity(..); and then just force the "fair_key" of the to something that *statistically* means that it should be at the end of its nice queue. I dunno. Linus -
From: Ingo Molnar <mingo@...> Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] [5]Date: Sep 19, 5:41 pm 2007 * Linus Torvalds <torvalds@linux-foundation.org> wrote: > Btw, the "enqueue at the end" could easily be a statistical thing, not > an absolute thing. So it's possible that we could perhaps implement > the CFS "yield()" using the same logic as we have now, except *not* > calling the "update_stats()" stuff: > > __dequeue_entity(..); > __enqueue_entity(..); > > and then just force the "fair_key" of the to something that > *statistically* means that it should be at the end of its nice queue. > > I dunno. i thought a bit about the statistical approach, and it's good in principle, but it has an implementational problem/complication: if there are only yielding tasks in the system, then the "queue rightwards in the tree, statistically" approach cycles through the key-space artificially fast. That can cause various problems. (this means that the workload-flag patch that uses yield_granularity is buggy as well. The queue-rightmost patch did not have this problem.) So right now there are only two viable options i think: either do the current weak thing, or do the rightmost thing. The statistical method might work too, but it needs more thought and more testing - i'm not sure we can get that ready for 2.6.23. So what we have as working code right now is the two extremes, and apps will really mostly prefer either the first (if they dont truly want to use yield but somehow it got into their code) or the second (if they want some massive delay). So while it does not have a good QoI, how about doing a compat_yield sysctl that allows the turning on of the "queue rightmost" logic? Find tested patch below. Peter, what do you think? Linus, if this would be acceptable for .23 then i'll push it out into sched.git together with another fix that Hiroshi Shimamoto just posted to lkml. Ingo --------------------------------------> Subject: sched: add /proc/sys/kernel/sched_compat_yield From: Ingo Molnar <mingo@elte.hu> add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. with sched_compat_yield=0: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop with sched_compat_yield=1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 1 kernel/sched.c | 5 --- kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 8 ++++++ 4 files changed, 67 insertions(+), 10 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1406,6 +1406,7 @@ extern unsigned int sysctl_sched_wakeup_ extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_compat_yield; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void) struct rq *rq = this_rq_lock(); schedstat_inc(rq, yld_cnt); - if (unlikely(rq->nr_running == 1)) - schedstat_inc(rq, yld_act_empty); - else - current->sched_class->yield_task(rq, current); + current->sched_class->yield_task(rq, current); /* * Since we are going to call schedule() anyway, there's Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -43,6 +43,14 @@ unsigned int sysctl_sched_latency __read unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; /* + * sys_sched_yield() compat mode + * + * This option switches the agressive yield implementation of the + * old scheduler back on. + */ +unsigned int __read_mostly sysctl_sched_compat_yield; + +/* * SCHED_BATCH wake-up granularity. * (default: 25 msec, units: nanoseconds) * @@ -897,19 +905,62 @@ static void dequeue_task_fair(struct rq } /* - * sched_yield() support is very simple - we dequeue and enqueue + * sched_yield() support is very simple - we dequeue and enqueue. + * + * If compat_yield is turned on then we requeue to the end of the tree. */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node **link = &cfs_rq->tasks_timeline.rb_node; + struct sched_entity *rightmost, *se = &p->se; + struct rb_node *parent; - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Are we the only task in the tree? + */ + if (unlikely(cfs_rq->nr_running == 1)) + return; + + if (likely(!sysctl_sched_compat_yield)) { + __update_rq_clock(rq); + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + + return; + } + /* + * Find the rightmost entry in the rbtree: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + do { + parent = *link; + link = &parent->rb_right; + } while (*link); + + rightmost = rb_entry(parent, struct sched_entity, run_node); + /* + * Already in the rightmost position? + */ + if (unlikely(rightmost == se)) + return; + + /* + * Minimally necessary key value to be last in the tree: + */ + se->fair_key = rightmost->fair_key + 1; + + if (cfs_rq->rb_leftmost == &se->run_node) + cfs_rq->rb_leftmost = rb_next(&se->run_node); + /* + * Relink the task to the rightmost position: + */ + rb_erase(&se->run_node, &cfs_rq->tasks_timeline); + rb_link_node(&se->run_node, parent, link); + rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -303,6 +303,14 @@ static ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_compat_yield", + .data = &sysctl_sched_compat_yield, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, #ifdef CONFIG_PROVE_LOCKING { .ctl_name = CTL_UNNUMBERED, -
From: Peter Zijlstra <a.p.zijlstra@...> Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] [5]Date: Sep 19, 5:58 pm 2007 On Wed, 19 Sep 2007 23:41:05 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > Btw, the "enqueue at the end" could easily be a statistical thing, not > > an absolute thing. So it's possible that we could perhaps implement > > the CFS "yield()" using the same logic as we have now, except *not* > > calling the "update_stats()" stuff: > > > > __dequeue_entity(..); > > __enqueue_entity(..); > > > > and then just force the "fair_key" of the to something that > > *statistically* means that it should be at the end of its nice queue. > > > > I dunno. > > i thought a bit about the statistical approach, and it's good in > principle, but it has an implementational problem/complication: if there > are only yielding tasks in the system, then the "queue rightwards in the > tree, statistically" approach cycles through the key-space artificially > fast. That can cause various problems. (this means that the > workload-flag patch that uses yield_granularity is buggy as well. The > queue-rightmost patch did not have this problem.) > > So right now there are only two viable options i think: either do the > current weak thing, or do the rightmost thing. The statistical method > might work too, but it needs more thought and more testing - i'm not > sure we can get that ready for 2.6.23. > > So what we have as working code right now is the two extremes, and apps > will really mostly prefer either the first (if they dont truly want to > use yield but somehow it got into their code) or the second (if they > want some massive delay). So while it does not have a good QoI, how > about doing a compat_yield sysctl that allows the turning on of the > "queue rightmost" logic? Find tested patch below. > > Peter, what do you think? I have to agree that for .23 we can't do much more than this. And tasks moving to the right without actually doing work and advancing fair_clock do scare me a little. Also, while I agree with Linus' definition of sched_yield, I'm afraid it will cause 'regressions' for all the interactivity people out there. Somehow this yield thing has made it into all sorts of unfortunate places like video drivers, so a heavy penalizing yield will hurt them. -
From: Ingo Molnar <mingo@...> Subject: Re: CFS: some bad numbers with Java/database threading [FIXED] [5]Date: Sep 19, 5:49 pm 2007 * Ingo Molnar <mingo@elte.hu> wrote: > Peter, what do you think? > > Linus, if this would be acceptable for .23 then i'll push it out into > sched.git together with another fix that Hiroshi Shimamoto just posted > to lkml. it's getting late here so i've pushed the current version of those two patches out to: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git I'll redo this git tree if we want some other solution for yield. (but i think this is the safest approach for 2.6.23 - some apps will complain about too strong yield, some apps will complain about too weak yield. So by providing the two extremes we at least cover the practical range of behavior.) there's nothing else pending for 2.6.23 otherwise at the moment, scheduler-wise. Ingo ------------------> Hiroshi Shimamoto (1): sched: fix invalid sched_class use Ingo Molnar (1): sched: add /proc/sys/kernel/sched_compat_yield include/linux/sched.h | 1 kernel/sched.c | 10 ++++--- kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 8 ++++++ 4 files changed, 72 insertions(+), 10 deletions(-) -


Related links:


Source URL:
http://kerneltrap.org/Linux/CFS_and_sched_yield