Hi Peter, hi Mike, hi Ingo, we have observed some strangeness regarding the runqueue clock updates when porting the LITMUS^RT patches (http://www.cs.unc.edu/~anderson/litmus-rt) from .34 to .36. Specifically, it seems that the flag rq->skip_clock_update remains set for prolonged times in some circumstances. Not updating the runqueue clock while a task is scheduled obviously causes the scheduler to make badly misguided decisions. I was also able to reproduce this on the latest pre-.37 _without_ LITMUS^RT patches (see below), so I don't think that the problem was introduced by us. First, here's a debug trace that illustrates the problem. (The trace is from LITMUS^RT applied to .36 on a four-CPU host with copious amounts of extra debug info. The first number in the trace is a sequence number and can be The workload consisted of tmux, a couple of top instances, and compiling a library (liblitmus). On CPU 3, kworker invoked schedule() because it blocks. Its TIF_NEED_RESCHED is cleared. The idle load balancer is activated because The idle load balancer marked kworker with set_tsk_need_resched(), even though kworker was on its way out anyway. CPU 3 continues to schedule and Some time passed and CPU 3 was active servicing ld and a different kworker. The kworker that was marked with set_tsk_need_resched() resumes on CPU 0. Because the flag was already set, check_preempt_curr() sets rq->skip_clock_update() *on CPU 3*. CPU 3 never knows that anything happened and continues to schedule its currently assigned task (ld). Because skip_clock_update is set, the reference time for the scheduler does not advance anymore. Hence ld seems to be consuming no execution time and CFS schedules it until it blocks by itself (or until a SCHED_FIFO or SCHED_RT task becomes active). And so on... This seems very wrong and it causes massive problems for LITMUS^RT. As a workaround, never setting skip_clock_update in check_preempt_curr() gets rid of all symptoms. I was under the impression ...
Yes. Does the below fix it up for you?
Sched: clear_tsk_need_resched() after pull_task() when NEWIDLE balancing
pull_task() may call set_tsk_need_resched() on a deactivated task,
leaving it vulnerable to an inappropriate preemption after wakeup.
This also confuses the skip_clock_update logic, which assumes that
schedule() will be called in very short order after being set. Make
that logic more robust by clearing in update_rq_clock() itself, so
only one update can be skipped.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
---
kernel/sched.c | 3 ++-
kernel/sched_fair.c | 10 ++++++++--
2 files changed, 10 insertions(+), 3 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -657,6 +657,8 @@ inline void update_rq_clock(struct rq *r
sched_irq_time_avg_update(rq, irq_time);
}
+
+ rq->skip_clock_update = 0;
}
/*
@@ -3714,7 +3716,6 @@ static void put_prev_task(struct rq *rq,
{
if (prev->se.on_rq)
update_rq_clock(rq);
- rq->skip_clock_update = 0;
prev->sched_class->put_prev_task(rq, prev);
}
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -2019,15 +2019,21 @@ balance_tasks(struct rq *this_rq, int th
pulled++;
rem_load_move -= p->se.load.weight;
-#ifdef CONFIG_PREEMPT
/*
+ * pull_task() may have set_tsk_need_resched(). Clear it
+ * lest a sleeper awaken and be inappropriately preempted
+ * shortly thereafter.
+ *
* NEWIDLE balancing is a source of latency, so preemptible
* kernels will stop after the first task is pulled to minimize
* the critical ...The patch definitely changes the behavior, but it doesn't seem to solve (all
of) the root cause(s). The failsafe kicks in and clears the flag the next
time that update_rq_clock() is called, but there can still be a significant
delay between setting and clearing the flag. Right after boot, I'm now seeing
If I let it run for a while with a couple 'find /' and 'top' processes in the
It makes sense for the maximum skip time to converge to 10ms since
HZ=100 on the ARM test system, so after at most 10ms scheduler_tick() should
kick in and call update_rq_clock().
I'm also seeing similar delays on an 8-core x86_64 system with HZ=1000. With
a workload of 'make clean && make -j8 bzImage' and 'top' I get stable delays
after boot in the range of 1ms and a max delay of ~53ms during boot (the
field skip_clock_recent_max resets to zero every 10000 samples;
The patch below applies on top of yours and should reproduce the behavior.
Please let me know if there is something else that I should test.
Thanks,
Bjoern
---
sched: show length of runqueue clock deactivation in /proc/sched_debug
The runqueue clock update should obviously not be skipped for
prolonged times (otherwise the consumed time is not correctly kept
track of). This patch measures the time between setting and clearing
the rq->skip_clock_update flag. The maximum observed value is exported
in /proc/sched_debug.
---
kernel/sched.c | 21 ++++++++++++++++++++-
kernel/sched_debug.c | 4 ++++
2 files changed, 24 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 996ed5c..21c1525 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -541,6 +541,11 @@ struct rq {
unsigned long long rq_cpu_time;
/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */
+ u64 skip_clock_max;
+ u64 skip_clock_recent_max;
+ u64 skip_clock_set;
+ unsigned int skip_clock_count;
+
/* sys_sched_yield() stats */
unsigned int yld_count;
@@ -653,6 +658,17 @@ inline void ...A pull isn't the only vulnerability. Since idle_balance() drops
Sched: clear_tsk_need_resched() after NEWIDLE balancing
idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched() from another CPU. Clear it
after NEWIDLE balancing to maintain the invariant that descheduled
tasks are NOT marked for resched.
This also confuses the skip_clock_update logic, which assumes that
the next call to update_rq_clock() will come nearly ĩmmediately after
being set. Make the optimization more robust by clearing before we
balance and in update_rq_clock().
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
---
kernel/sched.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -657,6 +657,7 @@ inline void update_rq_clock(struct rq *r
sched_irq_time_avg_update(rq, irq_time);
}
+ rq->skip_clock_update = 0;
}
/*
@@ -3714,7 +3715,6 @@ static void put_prev_task(struct rq *rq,
{
if (prev->se.on_rq)
update_rq_clock(rq);
- rq->skip_clock_update = 0;
prev->sched_class->put_prev_task(rq, prev);
}
@@ -3799,8 +3799,16 @@ need_resched_nonpreemptible:
pre_schedule(rq, prev);
- if (unlikely(!rq->nr_running))
+ if (unlikely(!rq->nr_running)) {
idle_balance(cpu, rq);
+ /*
+ * idle_balance() releases/retakes rq->lock, leaving prev
+ * vulnerable to set_tsk_need_resched() from another cpu.
+ * Clock updates should not be skipped while we're away.
+ */
+ clear_tsk_need_resched(prev);
+ rq->skip_clock_update = 0;
+ }
put_prev_task(rq, prev);
next = pick_next_task(rq);
--
Unfortunately that doesn't seem to do it yet.
After running five 'find /' instances to completion on the ARM platform,
I'm still seeing delays close to 10ms.
bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
cpu#0
.skip_clock_count : 89606
.skip_clock_recent_max : 9817250
.skip_clock_max : 21992375
cpu#1
.skip_clock_count : 81978
.skip_clock_recent_max : 9582500
.skip_clock_max : 17201750
cpu#2
.skip_clock_count : 74565
.skip_clock_recent_max : 9678000
.skip_clock_max : 9879250
cpu#3
.skip_clock_count : 81685
.skip_clock_recent_max : 9300125
.skip_clock_max : 14115750
On the x86_64 host, I've changed to HZ=100 and am now also seeing delays
close to 10ms after 'make clean && make -j8 bzImage'.
bbb@koruna:~$ egrep 'cpu#|skip' /proc/sched_debug
cpu#0, 2493.476 MHz
.skip_clock_count : 29703
.skip_clock_recent_max : 9999858
.skip_clock_max : 40645942
cpu#1, 2493.476 MHz
.skip_clock_count : 32696
.skip_clock_recent_max : 9959118
.skip_clock_max : 35074771
cpu#2, 2493.476 MHz
.skip_clock_count : 31742
.skip_clock_recent_max : 9788654
.skip_clock_max : 24821765
cpu#3, 2493.476 MHz
.skip_clock_count : 31123
.skip_clock_recent_max : 9858546
.skip_clock_max : 44276033
cpu#4, 2493.476 MHz
.skip_clock_count : 28346
.skip_clock_recent_max : 10000775
.skip_clock_max : 18681753
cpu#5, 2493.476 MHz
.skip_clock_count : 29421
.skip_clock_recent_max : 9997656
.skip_clock_max : 138473407
...Seems the checking for <if (prev->se.on_rq)> in put_prev_task()
is the culprit.
Because if we preempt a going sleep process on another CPU,
we will fail to update the rq clock for that CPU in schedule.
For example:
CPUA CPUB
process xxx == current
check_preempt_curr() for CPUB ... skip_clock_update==1
going to sleep
->schedule()
->deactivate_task() fail to update rq clock
because skip_clock_update==1
->put_prev_task() fail to update rq clock
because prev->se.on_rq==false
Then rq clock on CPUB will updated until another schedule envent
comes.
So Bjoern, is deleting the checking for prev->se.on_rq in
put_prev_task() helpful?
Thanks,
Yong
--
My test show there indeed is some improvement. But I just notice skip_clock_recent_max/max is based on _nanosecond_, so the 10ms delay mentioned by Bjoern should be _10us_. --
Because skip_clock_recent_max/skip_clock_max is based on nanosecond, --
Ohhhh. My head is twisted. Sorry for the noise. I should go to sleep now to make me clear. ;) Thanks, --
Please just ignore my previous email and sorry for those noise. I think I find the root cause after all-night sleep. :) when we init idle task, we doesn't mark it on_rq. My test show the concern is smoothed by below patch. Thanks, Yong --- diff --git a/kernel/sched.c b/kernel/sched.c index dc91a4d..21c76d9 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5486,6 +5486,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) __sched_fork(idle); idle->state = TASK_RUNNING; idle->se.exec_start = sched_clock(); + idle->se.on_rq = 1; cpumask_copy(&idle->cpus_allowed, cpumask_of(cpu)); /* --
Close :) The skip_clock_update flag should only be set if rq->curr is on_rq, because it it _that_ clock update during dequeue, and subsequent microscopic vruntime update it causes that we're trying to avoid. I think the below fixes it up properly. Sched: fix skip_clock_update optimization idle_balance() drops/retakes rq->lock, leaving the previous task vulnerable to set_tsk_need_resched(). Clear it after we return from balancing instead, and in setup_thread_stack() as well, so no successfully descheduled or never scheduled task has it set. Need resched confused the skip_clock_update logic, which assumes that the next call to update_rq_clock() will come nearly immediately after being set. Make the optimization robust against the waking a sleeper before it sucessfully deschedules case by checking that the current task has not been dequeued before setting the flag, since it is that useless clock update we're trying to save, and clear in update_rq_clock() to ensure that ONE call may be skipped. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com> Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com> --- kernel/fork.c | 1 + kernel/sched.c | 6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) Index: linux-2.6.37.git/kernel/sched.c =================================================================== --- linux-2.6.37.git.orig/kernel/sched.c +++ linux-2.6.37.git/kernel/sched.c @@ -660,6 +660,7 @@ inline void update_rq_clock(struct rq *r sched_irq_time_avg_update(rq, irq_time); } + rq->skip_clock_update = 0; } /* @@ -2138,7 +2139,7 @@ static void check_preempt_curr(struct rq * A queue event has occurred, and we're going to schedule. In * this case, we can save a useless back to back clock update. */ - if (test_tsk_need_resched(rq->curr)) + if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr)) ...
Yep. Now it's running well on my machine. If you want, you can add my tested-by. :) Thanks, --
Done.
Sched: fix skip_clock_update optimization
idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched(). Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.
Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set. Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear in update_rq_clock() to ensure that ONE call may be skipped.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Tested-by: Yong Zhang <yong.zhang0@gmail.com>
---
kernel/fork.c | 1 +
kernel/sched.c | 6 +++---
2 files changed, 4 insertions(+), 3 deletions(-)
Index: linux-2.6.37.git/kernel/sched.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched.c
+++ linux-2.6.37.git/kernel/sched.c
@@ -660,6 +660,7 @@ inline void update_rq_clock(struct rq *r
sched_irq_time_avg_update(rq, irq_time);
}
+ rq->skip_clock_update = 0;
}
/*
@@ -2138,7 +2139,7 @@ static void check_preempt_curr(struct rq
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
- if (test_tsk_need_resched(rq->curr))
+ if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
rq->skip_clock_update = 1;
}
@@ -3854,7 +3855,6 @@ static void put_prev_task(struct rq *rq,
{
if (prev->se.on_rq)
update_rq_clock(rq);
- rq->skip_clock_update = 0;
...Shouldn't we do that at the end of schedule()? Since the purpose of ->skip_clock_update is to avoid multiple calls to: - avoid overhead - ensure scheduling is accounted at a single point [ for that latter purpose it might also make sense to put that point somewhere around context_switch() but due to the fact that we need a clock update early that's a bit impractical. ] OK, I initially tried to replace the test with a return value of ->check_preempt_curr() and such, but that turns into a lot of code and Good find, this needs to be done after the idle balancing because that can release the rq->lock and allow for TIF_NEED_RESCHED to be set again. Maybe complement this with a WARN_ON_ONCE(test_tsk_need_resched(next)) somewhere after pick_next_task() so as to ensure that !current has ! OK.. have we looked if there's more TIF flags that could do with a reset? --
Yeah, could do that instead. There's no gain in any call that may happen in the interval between. Think I'll measure though, this bug was mmm, no. -Mike --
Crawling over flags had to be left for another day.
Sched: fix skip_clock_update optimization
idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched(). Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.
Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set. Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear unconditionally in schedule() proper instead of conditionally
in put_prev_task().
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Tested-by: Yong Zhang <yong.zhang0@gmail.com>
---
kernel/fork.c | 1 +
kernel/sched.c | 28 +++++++++++++++-------------
2 files changed, 16 insertions(+), 13 deletions(-)
Index: linux-2.6.37.git/kernel/sched.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched.c
+++ linux-2.6.37.git/kernel/sched.c
@@ -649,17 +649,18 @@ static void sched_irq_time_avg_update(st
inline void update_rq_clock(struct rq *rq)
{
- if (!rq->skip_clock_update) {
- int cpu = cpu_of(rq);
- u64 irq_time;
-
- rq->clock = sched_clock_cpu(cpu);
- irq_time = irq_time_cpu(cpu);
- if (rq->clock - irq_time > rq->clock_task)
- rq->clock_task = rq->clock - irq_time;
+ int cpu = cpu_of(rq);
+ u64 irq_time;
- sched_irq_time_avg_update(rq, irq_time);
- }
+ if (rq->skip_clock_update)
+ return;
+
+ rq->clock = sched_clock_cpu(cpu);
+ irq_time = irq_time_cpu(cpu);
+ if ...Sure, just something we need to keep on the todo list (as if that thing isn't bottomless ;-) Anyway, Thanks I've queued it up (added stable@kernel.org as well) and will hopefully still get it in .37. --
Commit-ID: f26f9aff6aaf67e9a430d16c266f91b13a5bff64 Gitweb: http://git.kernel.org/tip/f26f9aff6aaf67e9a430d16c266f91b13a5bff64 Author: Mike Galbraith <efault@gmx.de> AuthorDate: Wed, 8 Dec 2010 11:05:42 +0100 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Wed, 8 Dec 2010 20:15:06 +0100 Sched: fix skip_clock_update optimization idle_balance() drops/retakes rq->lock, leaving the previous task vulnerable to set_tsk_need_resched(). Clear it after we return from balancing instead, and in setup_thread_stack() as well, so no successfully descheduled or never scheduled task has it set. Need resched confused the skip_clock_update logic, which assumes that the next call to update_rq_clock() will come nearly immediately after being set. Make the optimization robust against the waking a sleeper before it sucessfully deschedules case by checking that the current task has not been dequeued before setting the flag, since it is that useless clock update we're trying to save, and clear unconditionally in schedule() proper instead of conditionally in put_prev_task(). Signed-off-by: Mike Galbraith <efault@gmx.de> Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com> Tested-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org LKML-Reference: <1291802742.1417.9.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/fork.c | 1 + kernel/sched.c | 26 ++++++++++++++------------ 2 files changed, 15 insertions(+), 12 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 3b159c5..5447dc7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -273,6 +273,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) setup_thread_stack(tsk, orig); clear_user_return_notifier(tsk); + clear_tsk_need_resched(tsk); stackend = end_of_stack(tsk); *stackend = STACK_END_MAGIC; /* for overflow detection */ diff --git a/kernel/sched.c b/kernel/sched.c index ...
I just managed to trigger this one.. will try and reproduce to find out wth happened. --
I don't see it on my side. Then just show some info(don't know if it's helpful): wake_up_idle_cpu() can set TIF_RESCHED locklessly. So does WARN_ON_ONCE(next != rq->idle && test_tsk_need_resched(next)); help? Thanks, Yong -- Only stand for myself. --
Its a good possibility, sadly I haven't been able to reproduce so I cannot confirm. I'll keep my machine running, maybe it will trigger in a few hours. --
Mike, Yong,
thanks for looking into this. On the x86_64 host, I'm now see seeing
delays of less than 0.1ms, which seems fine. On the ARM host, I'm seeing
delays of up to ~2.5ms. However, this only happens on CPU0, which makes me
think that it is probably just a byproduct of interrupts.
bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
cpu#0
.skip_clock_count : 98494
.skip_clock_recent_max : 2543000
.skip_clock_max : 9876875
cpu#1
.skip_clock_count : 61084
.skip_clock_recent_max : 861750
.skip_clock_max : 1382875
cpu#2
.skip_clock_count : 60846
.skip_clock_recent_max : 193375
.skip_clock_max : 2236500
cpu#3
.skip_clock_count : 60270
.skip_clock_recent_max : 318750
.skip_clock_max : 1679000
bbb@district10:~$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
33: 29 0 0 0 GIC timer
35: 49 0 0 0 GIC isp1760-hcd:usb1
36: 145 0 0 0 GIC uart-pl011
39: 10 0 0 0 GIC kmi-pl050
40: 116 0 0 0 GIC kmi-pl050
41: 215789 0 0 0 GIC eth0
46: 0 0 0 0 GIC mmci-pl18x (cmd)
47: 0 0 0 0 GIC mmci-pl18x (pio)
IPI: 102721 103079 94532 111227
LOC: 115129 114840 114166 115122
Err: 0
So the latest patch seems to solve the issue.
Tested-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Btw, I think this patch is a good candidate for the .36-stable tree since
it fixes a major regression.
Thanks,
Bjoern
--
Hi Bjoern, sorry the the slow response, and sadly I'll have to ask a little bit more patience yet :/ It a nice find and I'll look into it as soon as possible. --
