* Linus Torvalds <torvalds@linux-foundation.org> wrote:i thought a bit about the statistical approach, and it's good in principle, but it has an implementational problem/complication: if there are only yielding tasks in the system, then the "queue rightwards in the tree, statistically" approach cycles through the key-space artificially fast. That can cause various problems. (this means that the workload-flag patch that uses yield_granularity is buggy as well. The queue-rightmost patch did not have this problem.) So right now there are only two viable options i think: either do the current weak thing, or do the rightmost thing. The statistical method might work too, but it needs more thought and more testing - i'm not sure we can get that ready for 2.6.23. So what we have as working code right now is the two extremes, and apps will really mostly prefer either the first (if they dont truly want to use yield but somehow it got into their code) or the second (if they want some massive delay). So while it does not have a good QoI, how about doing a compat_yield sysctl that allows the turning on of the "queue rightmost" logic? Find tested patch below. Peter, what do you think? Linus, if this would be acceptable for .23 then i'll push it out into sched.git together with another fix that Hiroshi Shimamoto just posted to lkml. Ingo --------------------------------------> Subject: sched: add /proc/sys/kernel/sched_compat_yield From: Ingo Molnar <mingo@elte.hu> add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. with sched_compat_yield=0: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop with sched_compat_yield=1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 1 kernel/sched.c | 5 --- kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 8 ++++++ 4 files changed, 67 insertions(+), 10 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1406,6 +1406,7 @@ extern unsigned int sysctl_sched_wakeup_ extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_compat_yield; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void) struct rq *rq = this_rq_lock(); schedstat_inc(rq, yld_cnt); - if (unlikely(rq->nr_running == 1)) - schedstat_inc(rq, yld_act_empty); - else - current->sched_class->yield_task(rq, current); + current->sched_class->yield_task(rq, current); /* * Since we are going to call schedule() anyway, there's Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -43,6 +43,14 @@ unsigned int sysctl_sched_latency __read unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; /* + * sys_sched_yield() compat mode + * + * This option switches the agressive yield implementation of the + * old scheduler back on. + */ +unsigned int __read_mostly sysctl_sched_compat_yield; + +/* * SCHED_BATCH wake-up granularity. * (default: 25 msec, units: nanoseconds) * @@ -897,19 +905,62 @@ static void dequeue_task_fair(struct rq } /* - * sched_yield() support is very simple - we dequeue and enqueue + * sched_yield() support is very simple - we dequeue and enqueue. + * + * If compat_yield is turned on then we requeue to the end of the tree. */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node **link = &cfs_rq->tasks_timeline.rb_node; + struct sched_entity *rightmost, *se = &p->se; + struct rb_node *parent; - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Are we the only task in the tree? + */ + if (unlikely(cfs_rq->nr_running == 1)) + return; + + if (likely(!sysctl_sched_compat_yield)) { + __update_rq_clock(rq); + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + + return; + } + /* + * Find the rightmost entry in the rbtree: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + do { + parent = *link; + link = &parent->rb_right; + } while (*link); + + rightmost = rb_entry(parent, struct sched_entity, run_node); + /* + * Already in the rightmost position? + */ + if (unlikely(rightmost == se)) + return; + + /* + * Minimally necessary key value to be last in the tree: + */ + se->fair_key = rightmost->fair_key + 1; + + if (cfs_rq->rb_leftmost == &se->run_node) + cfs_rq->rb_leftmost = rb_next(&se->run_node); + /* + * Relink the task to the rightmost position: + */ + rb_erase(&se->run_node, &cfs_rq->tasks_timeline); + rb_link_node(&se->run_node, parent, link); + rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -303,6 +303,14 @@ static ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_compat_yield", + .data = &sysctl_sched_compat_yield, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, #ifdef CONFIG_PROVE_LOCKING { .ctl_name = CTL_UNNUMBERED, -
| Rafael J. Wysocki | [Bug #11355] Regression in 2.6.27-rc2 when cross-building the kernel |
| David Miller | [GIT]: Networking |
| Robin Lee Powell | NFS hang + umount -f: better behaviour requested. |
| Artem Bityutskiy | [PATCH 11/22 take 3] UBI: user-interfaces unit |
git: | |
| David Symonds | Re: I don't want the .git directory next to my code. |
| Petr Baudis | [ANNOUNCE] TopGit - A different patch queue manager |
| Daniel Barkalow | Non-http dumb protocols |
| Kevin Ballard | Re: git on MacOSX and files with decomposed utf-8 file names |
| Jim Winstead Jr. | Re: Root Disk/Book Disk Compatibility |
| Linus Benedict Torvalds | Trying to answer ... |
| Linus Torvalds | Re: mget / mput = bad director components |
| Gerard A. Allan | BUG REPORT --- call chsvga (really call getkey) |
| c l | site-to-site vpn 4.0 to cisco 3000 |
| Darren Spruell | Re: OpenBSD as Virtualbox guest |
| Richard Stallman | Real men don't attack straw men |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
