On Wed, 2010-07-21 at 17:10 +0530, Jack Daniel wrote:
What values get incorrect, do you observe vruntime funnies or only the
schedstat values?
Ah, so the problem is that task_fork_fair() does the task placement
without updated rq clocks? In which case I think we should at least do
an update_rq_clock(rq) in there (see the below patch).
sched_clock_stable being true implies the clock is stable across cores
and thus it shouldn't matter. Or are you saying you're seeing it being
set and still have issues?
The problem here is that your patch introduces the exact race I was
trying to close. We can only access rq->clock when also holding the
appropriate rq->lock, disabling IRQs, while also required is not
sufficient.
[ Also, that rmb() looks just plain wrong ]
Anyway, does the below cure your trouble? (Also, could you describe your
actually observed trouble in more detail?)
---
Subject: sched: ensure rq->clock get sync'ed when migrating tasks
sched_fork() -- we do task placement in ->task_fork_fair() ensure we
update_rq_clock() so we work with current time. We leave the vruntime
in relative state, so the time delay until wake_up_new_task() doesn't
matter.
wake_up_new_task() -- Since task_fork_fair() left p->vruntime in
relative state we can safely migrate, the activate_task() on the
remote rq will call update_rq_clock() and causes the clock to be
synced (enough).
try_to_wake_up() -- In case we'll migrate, we need to update the old
rq->clock, the activate_task() in ttwu_activate() will already update
the new rq->clock, and thus the clocks will get sync'ed.
load-balance -- Migrating running tasks always happens with both rq's
locked, either through double_rq_lock() or double_lock_balance(). So
sync the rq->clock there.
The problem seems to be that this all could result in too many rq->clock
updates, which are expensive.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 20 ++++++++++++++++++--
kernel/sched_fair.c | 2 ++
2 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index d3c0262..69584b4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1756,13 +1756,22 @@ static int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
*/
static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
{
+ int ret;
+
+#ifdef CONFIG_SCHED_DEBUG
if (unlikely(!irqs_disabled())) {
/* printk() doesn't work good under rq->lock */
raw_spin_unlock(&this_rq->lock);
BUG_ON(1);
}
+#endif
+
+ ret = _double_lock_balance(this_rq, busiest);
+
+ update_rq_clock(this_rq);
+ update_rq_clock(busiest);
- return _double_lock_balance(this_rq, busiest);
+ return ret;
}
static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
@@ -1782,7 +1791,9 @@ static void double_rq_lock(struct rq *rq1, struct rq *rq2)
__acquires(rq1->lock)
__acquires(rq2->lock)
{
+#ifdef CONFIG_SCHED_DEBUG
BUG_ON(!irqs_disabled());
+#endif
if (rq1 == rq2) {
raw_spin_lock(&rq1->lock);
__acquire(rq2->lock); /* Fake it out ;) */
@@ -1795,6 +1806,9 @@ static void double_rq_lock(struct rq *rq1, struct rq *rq2)
raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
}
}
+
+ update_rq_clock(rq1);
+ update_rq_clock(rq2);
}
/*
@@ -2395,8 +2409,10 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
}
cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
- if (cpu != orig_cpu)
+ if (cpu != orig_cpu) {
set_task_cpu(p, cpu);
+ update_rq_clock(rq);
+ }
__task_rq_unlock(rq);
rq = cpu_rq(cpu);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 9910e1b..f816e74 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3751,6 +3751,8 @@ static void task_fork_fair(struct task_struct *p)
raw_spin_lock_irqsave(&rq->lock, flags);
+ update_rq_clock(rq);
+
if (unlikely(task_cpu(p) != this_cpu))
__set_task_cpu(p, this_cpu);
--