Re: [git] CFS-devel, group scheduler, fixes

Previous thread: [PATCH 1/1] unify DMA_..BIT_MASK definitions: v3 by Borislav Petkov on Tuesday, September 18, 2007 - 3:27 pm. (2 messages)

Next thread: [git] CFS-devel, group scheduler, fixes by Dmitry Adamushko on Tuesday, September 18, 2007 - 3:46 pm. (2 messages)
To: <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Tuesday, September 18, 2007 - 3:36 pm

[ well, don't expect to find here anything like RDCFS
(no, 'D' does not stand for 'dumb'!). I was focused
on more prosaic things in the mean time so just
didn't have time for writing it.. ]

here is a few cleanup/simplification/optimization(s)
based on the recent modifications in the sched-dev tree.

(1) optimize task_new_fair()
(2) simplify yield_task()
(3) rework enqueue/dequeue_entity() to get rid of
sched_class::set_curr_task()

additionally, the changes somewhat decrease code size:

text data bss dec hex filename
43538 5398 48 48984 bf58 build/kernel/sched.o.before
43250 5390 48 48688 be30 build/kernel/sched.o

(SMP + lots of debugging options but, I guess, in this case the diff
should remain visible for any combination).

---

(1)

due to the fact that we no longer keep the 'current' within the tree,
dequeue/enqueue_entity() is useless for the 'current' in task_new_fair().
We are about to reschedule and sched_class->put_prev_task() will put
the 'current' back into the tree, based on its new key.

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>

---
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6e52d5a..5a244e2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -969,10 +969,11 @@ static void task_new_fair(struct rq *rq, struct task_struct *p)

if (sysctl_sched_child_runs_first &&
curr->vruntime < se->vruntime) {
-
- dequeue_entity(cfs_rq, curr, 0);
+ /*
+ * Upon rescheduling, sched_class::put_prev_task() will place
+ * 'current' within the tree based on its new key value.
+ */
swap(curr->vruntime, se->vruntime);
- enqueue_entity(cfs_rq, curr, 0);
}

update_stats_enqueue(cfs_rq, se);

---

Dmitry

-

To: dimm <dmitry.adamushko@...>
Cc: <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Tuesday, September 18, 2007 - 4:16 pm

thanks, applied.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>, Mike Galbraith <efault@...>
Date: Wednesday, September 19, 2007 - 2:03 am

This patch attempts to improve CFS's SMP global fairness based on the new
virtual time design.

Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Modified small_imbalance logic in find_busiest_group(). If there's small
imbalance, move tasks from busiest to local sched_group only if the local
group contains a CPU whose min_vruntime is the maximum among all CPUs in
the same sched_domain. This prevents any CPU from advancing too far ahead
in virtual time and avoids tasks thrashing between two CPUs without
utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs,
since the load is not evenly divisible by the number of CPUs, we want the
extra load to have a fair use of every CPU in the system.

Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task
runs a trivial while (1) loop. The benchmark runs for 300 seconds and, at
every T seconds, it samples for each task the following:

1. Actual CPU time the task received during the past 60 seconds.

2. Ideal CPU time it would receive under a perfect fair scheduler.

3. Lag = ideal time - actual time, where a positive lag means the task
received less CPU time than its fair share and negative means it received
more.

4. Error = lag / ideal time

The following shows the max and min errors among all samples for all tasks
before and after applying the patch:

Before:

Sampling interval: 30 s
Max error: 100.00%
Min error: -25.00%

Sampling interval: 10 s
Max error: 27.62%
Min error: -25.00%

After:

Sampling interval: 30 s
Max error: 1.33%
Min error: -1.29%

Sampling interval: 10 s
Max error: 7.38%
Min error: -6.25%

The errors for the 10s sampling interval are still not as small as I had
hoped for, but looks like it does have some improvement.

tong

Signed-off-by: Tong Li <tong.n.li@intel.com>
---
--- linux-2.6-sched-devel-orig/kernel/sched.c 2007-09-15 22:00:48.000000000 -0700
+++ linux-2.6-sched-devel/kernel/sched.c 2007-09-18 22:10:52....

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>, Mike Galbraith <efault@...>
Date: Wednesday, September 19, 2007 - 3:35 pm

Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue

This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC

Not sure how this all interacts when some of the cpu's are idle. I have to
look more closely.

thanks,
suresh
-

To: Siddha, Suresh B <suresh.b.siddha@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>, Mike Galbraith <efault@...>
Date: Wednesday, September 19, 2007 - 4:58 pm

Right. For 82 tasks on 8 CPUs, the max error is 7.07% and min is -14.12%.
It could be fixed by removing the imbalance_pct check (i.e., making
imbalance_pct effectively 1). The cost would be more migrations, so I

What are the optimizations? I was trying to simplify the code to use only
vruntime to control balancing when there's small imbalance, but if it
breaks non-fairness related optimizations, we can certainly add the code
back.

tong
-

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 2:28 am

Greetings,

Since I'm (still) encountering Xorg latency issues (which go away if
load is hefty instead of light) even with that migration adjustment and
synchronization, and am having difficulty nailing it down to a specific
event, I'll test this immediately.

-Mike

-

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 3:51 am

(had to apply manually to freshly pulled tree)

Drat. This didn't cure the latency hits with a Xorg at nice -5 running
with a make -j2 at nice 0, but seems to have reinstated a latency issue
which was previously cured.

Xorg 1 sec. max latency samples:
(trimmed to only show >20ms latencies)
se.wait_max : 23343582
se.wait_max : 20119460
se.wait_max : 20771573
se.wait_max : 21084567
se.wait_max : 31338500
se.wait_max : 35368148
se.wait_max : 39199642
se.wait_max : 22889062
se.wait_max : 40285501
se.wait_max : 21002720
se.wait_max : 21002266
se.wait_max : 21680578
se.wait_max : 22012913
se.wait_max : 94646331
se.wait_max : 29003693
se.wait_max : 20812613

(boot with maxcpus=1 or nail X+make to one cpu and these latencies are
gone, so it does seem to be the migration logic - why i was so
interested in testing your patch)

The scenario which was previously cured was this:
taskset -c 1 nice -n 0 ./massive_intr 2 9999
taskset -c 1 nice -n 5 ./massive_intr 2 9999
click link
(http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
up browser and OpenOffice Impress.

Xorg (at nice -5 + above scenario) latency samples:
se.wait_max : 57985337
se.wait_max : 25163510
se.wait_max : 37005538
se.wait_max : 66986511
se.wait_max : 53990868
se.wait_max : 80976761
se.wait_max : 96967501
se.wait_max : 80989254
se.wait_max : 53990897
se.wait_max : 181963905
...

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 4:42 am

To be doubly sure of the effect on the pinned tasks + migrating Xorg
scenario, I just ran the above test 10 times with virgin devel source.
Maximum Xorg latency was 20ms.

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 1:06 pm

Yeah, the patch was a first attempt at getting better global fairness for
unpinned tasks. I expected there'd be latency problems when unpinned and
pinned tasks co-exist. The original code for vruntime adjustment in
set_task_cpu() helped alleviate this problem, so removing it in my patch
would definitely bring the problem back. On the other hand, leaving it
there broke global fairness in my fairness benchmark. So we'd need a
better compromise.

Were the experiments run on a 2-CPU system? When Xorg experiences large
wait time, is it on the same CPU that has the two pinned tasks? If this is
the case, then the problem could be X somehow advanced faster and got a
larger vruntime then the two pinned tasks, so it had to wait for the
pinned ones to catch up before it got a chance to be scheduled.

tong
-

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Thursday, September 20, 2007 - 12:55 am

Good question. I've not yet been able to capture any event where
vruntimes are that far apart in sched_debug.

-Mike

-

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Thursday, September 20, 2007 - 3:15 am

But, I did just manage to trigger some horrid behavior, and log it. I
modified the kernel to print task's actual tree key instead of their
current vruntime, and was watching that while make -j2 was running (and
not seeing anything very interesting), when on a lark, I restarted
SuSE's system updater thingy. That beast chews 100% CPU for so long at
startup that I long ago got annoyed, and changed it to run at nice 19.
Anyway, when it started, interactivity went to hell in the proverbial
hand-basket, and the sched_debug log shows some interesting results..
like spread0 hitting -13659412644, and cc1 being keyed at -3867063305.

cpu#0, 2992.608 MHz
.nr_running : 4
.load : 4096
.nr_switches : 1105882
.nr_load_updates : 735146
.nr_uninterruptible : 4294966399
.jiffies : 1936201
.next_balance : 1936276
.curr->pid : 27004
.clock : 735034802877
.idle_clock : 0
.prev_clock_raw : 2259213083886
.clock_warps : 0
.clock_overflows : 677631
.clock_deep_idle_events : 0
.clock_max_delta : 999848
.cpu_load[0] : 4096
.cpu_load[1] : 3960
.cpu_load[2] : 3814
.cpu_load[3] : 3539
.cpu_load[4] : 3153

cfs_rq
.exec_clock : 623175870707
.MIN_vruntime : 3791173262297
.min_vruntime : 3791173259723
.max_vruntime : 3791173264579
.spread : 2282
.spread0 : 0
.nr_sync_min_vruntime : 296756

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
-----------------------------------------------------------------------------...

To: Mike Galbraith <efault@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Thursday, September 20, 2007 - 3:48 pm

I don't know if this is relevant, but 4294966399 in nr_uninterruptible
for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
I don't know if this rings a bell for someone or if it's a completely
useless comment, but just in case...

HTH,
Willy

-

To: Willy Tarreau <w@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Thursday, September 20, 2007 - 10:40 pm

A task can block on one cpu, and wake up on another, which isn't
tracked, hence the fishy looking numbers. The true nr_uninterruptible
is the sum of all.

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Thursday, September 20, 2007 - 11:11 pm

OK, thanks Mike for this clarification.

Willy

-

To: Mike Galbraith <efault@...>
Cc: Tong Li <tong.n.li@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Thursday, September 20, 2007 - 3:51 am

btw., that looks like a debug printout bug in sched-devel.git - could
you send me your fix? I've pushed out the latest sched-devel (ontop of
-rc7) to the usual place:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

(note that there are some debug printout updates which might clash with
your fix)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Tong Li <tong.n.li@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Thursday, September 20, 2007 - 4:11 am

Well, I temporarily added a key field, which is only used to store the

I'll pull/build this and test my migration problem there. All I have to
do is to add a nice 19 chew-max to my make -j2, and all hell breaks
loose. Always suspecting myself first in the search for dirty rotten
SOB who broke local scheduler :) I nuked the migration fix. Looks like

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Tong Li <tong.n.li@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Friday, September 21, 2007 - 11:27 pm

Mike,

Could you try this patch to see if it solves the latency problem?

tong

Changes:

1. Modified vruntime adjustment logic in set_task_cpu(). See comments in
code. This fixed the negative vruntime problem.

2. This code in update_curr() seems to be wrong:

if (unlikely(!curr))
return sync_vruntime(cfs_rq);

cfs_rq->curr can be NULL even if cfs_rq->nr_running is non-zero (e.g.,
when an RT task is running). We only want to call sync_vruntime when
cfs_rq->nr_running is 0. This fixed the large latency problem (at least in
my tests).

Signed-off-by: Tong Li <tong.n.li@intel.com>
---
diff -uprN linux-2.6-sched-devel-orig/kernel/sched.c linux-2.6-sched-devel/kernel/sched.c
--- linux-2.6-sched-devel-orig/kernel/sched.c 2007-09-20 12:15:41.000000000 -0700
+++ linux-2.6-sched-devel/kernel/sched.c 2007-09-21 19:40:08.000000000 -0700
@@ -1033,9 +1033,20 @@ void set_task_cpu(struct task_struct *p,
if (p->se.block_start)
p->se.block_start -= clock_offset;
#endif
- if (likely(new_rq->cfs.min_vruntime))
- p->se.vruntime -= old_rq->cfs.min_vruntime -
- new_rq->cfs.min_vruntime;
+ /*
+ * Reset p's vruntime if it moves to new_cpu whose min_vruntime is
+ * 100,000,000 (equivalent to 100ms for nice-0 tasks) larger or
+ * smaller than p's vruntime. This improves interactivity when
+ * pinned and unpinned tasks co-exist. For example, pinning a few
+ * tasks to a CPU can cause its min_vruntime much smaller than the
+ * other CPUs. If a task moves to this CPU, its vruntime can be so
+ * large it won't be scheduled until the locally pinned tasks'
+ * vruntimes catch up, causing large delays.
+ */
+ if (unlikely(old_cpu != new_cpu && p->se.vruntime &&
+ (p->se.vruntime > new_rq->cfs.min_vruntime + 100000000 ||
+ p->se.vruntime + 100000000 < new_rq->cfs.min_vruntime)))
+ p->se.vruntime = new_rq->cfs.min_vruntime;

__set_task_cpu(p, new_cpu);
}
@@ -1599,6 +1610,7 @@ s...

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Saturday, September 22, 2007 - 6:01 am

No, but it helps some when running two un-pinned busy loops, one at nice
0, and the other at nice 19. Yesterday I hit latencies of up to 1.2
_seconds_ doing this, and logging sched_debug and /proc/`pidof
Xorg`/sched from SCHED_RR shells.

se.wait_max : 164.242748
se.wait_max : 121.996128
se.wait_max : 194.464773
se.wait_max : 517.425411
se.wait_max : 131.453214
se.wait_max : 122.984190
se.wait_max : 111.729274
se.wait_max : 119.567580
se.wait_max : 126.980696
se.wait_max : 177.241452
se.wait_max : 129.604329
se.wait_max : 119.631657

It doesn't help at all with this oddity while running same:

root@Homer: now is
the time
foooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooor aall good men to come to the aid of their country

That was a nice 0 shell window. I'm not a great typist, but I ain't
_that_ bad :)

-Mike

-

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Sunday, September 23, 2007 - 3:14 am

Looking at a log (snippet attached) from this morning with the last hunk
of your patch still intact, it looks like min_vruntime is being modified
after a task is queued. If you look at the snippet, you'll see the nice
19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
second later on CPU1 with it's vruntime at 2814952.425082, but
min_vruntime is 3061874.838356.

I took a hammer to it, and my latencies running this test went away.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-22 13:37:32.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-23 08:29:38.000000000 +0200
@@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str
static void sync_vruntime(struct cfs_rq *cfs_rq)
{
struct rq *rq;
- int cpu;
+ int cpu, wrote = 0;

for_each_online_cpu(cpu) {
rq = &per_cpu(runqueues, cpu);
+ if (spin_is_locked(&rq->lock))
+ continue;
+ smp_wmb();
cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
rq->cfs.min_vruntime);
+ wrote++;
}
- schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+ if (wrote)
+ schedstat_inc(cfs_rq, nr_sync_min_vruntime);
}

static void update_curr(struct cfs_rq *cfs_rq)
@@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)->clock;
unsigned long delta_exec;

- if (unlikely(!curr))
+ if (unlikely(!cfs_rq->nr_running))
return sync_vruntime(cfs_rq);
+ if (unlikely(!curr))
+ return;

/*
* Get the amount of time the current task was running

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Monday, September 24, 2007 - 2:21 am

I think this could be what was happening: between the two seconds, CPU 0
becomes idle and it pulls the nice 19 task over via pull_task(), which
calls set_task_cpu(), which changes the task's vruntime to the current
min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0
calls activate_task(), which calls enqueue_task() and in turn
update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets
called and CPU 0's min_vruntime gets synced to the system max. Thus, the
nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think
this can be fixed by adding the following code in set_task_cpu() before we
adjust p->vruntime:

if (!new_rq->cfs.nr_running)

I think this rq->lock check works because it prevents the above scenario
(CPU 0 is in pull_task so it must hold the rq lock). But my concern is
that it may be too conservative, since sync_vruntime is called by
update_curr, which mostly gets called in enqueue_task() and
dequeue_task(), both of which are often invoked with the rq lock being
held. Thus, if we don't allow sync_vruntime when rq lock is held, the sync
will occur much less frequently and thus weaken cross-CPU fairness.

tong
-

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Monday, September 24, 2007 - 6:10 am

Hmm. I can imagine Mondo-Boxen-R-Us folks getting upset with that.
Better would be like Ingo said, see if we can toss sync_vrintime(), and
I've been playing with that...

I found something this morning, and as usual, the darn thing turned out
to be dirt simple. With sync_vruntime() disabled, I found queues with
negative min_vruntime right from boot, and went hunting. Adding some
instrumentation to set_task_cpu() (annoying consequences), I found the
below:

Legend:
vrun: tasks's vruntime
old: old queue's min_vruntime
new: new queue's min_vruntime
result: what's gonna happen

[ 60.214508] kseriod vrun: 1427596999 old: 15070988657 new: 4065818654 res: -9577573004
[ 218.274661] konqueror vrun: 342076210254 old: 658982966338 new: 219203403598 res: -97703352486
[ 218.284657] init vrun: 411638725179 old: 659187735208 new: 219203492472 res: -28345517557
[...]

A task which hasn't run in long enough for queues to have digressed
further than it's vruntime is going to end up with a negative vruntime.
Looking at place_entity(), it looks like it's supposed to fix that up,
but due to the order of arguments passed to max_vrintime(), and the
unsigned comparison therein, it won't.

Running with the patchlet below, my box _so far_ has not become
terminally unhappy despite spread0. I'm writing this with the pinned
hogs test running right now, and all is well, so I _think_ it might be
ok to just remove sync_vruntime() after all.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-23 14:48:18.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-24 11:02:05.000000000 +0200
@@ -117,7 +117,7 @@ static inline struct task_struct *task_o
static inline u64
max_vruntime(u64 min_vruntime, u64 vruntime)
{
- if ((vruntime > min_vruntime) ||
+ if (((s64)vruntime > (s64)min_vruntime) ||
(min_vruntime > (1ULL << 61) && vruntime < (1...

To: Mike Galbraith <efault@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, September 24, 2007 - 6:24 am

how about something like:

s64 delta = (s64)(vruntime - min_vruntime);
if (delta > 0)
min_vruntime += delta;

That would rid us of most of the funny conditionals there.
-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, September 24, 2007 - 6:42 am

That still left me with negative min_vruntimes. The pinned hogs didn't
lock my box up, but I quickly got the below, so hastily killed it.

se.wait_max : 7.846949
se.wait_max : 301.951601
se.wait_max : 7.071359

-Mike

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, September 24, 2007 - 7:22 am

Shouldn't the max() in place_entity() be the max_vruntime() that my
lysdexia told me it was when I looked at it earlier? ;-)

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, September 24, 2007 - 7:51 am

probably, my tree doesn't have that max anymore so I'm not sure.
-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, September 24, 2007 - 12:43 pm

Last time I looked, I thought the max() in place_entity() was fine and the
problem seemed to be in set_task_cpu() that was causing the negative
vruntimes:

if (likely(new_rq->cfs.min_vruntime))
p->se.vruntime -= old_rq->cfs.min_vruntime -
new_rq->cfs.min_vruntime;

I think it's fine we get rid of sync_vruntime(). I need to think more
about how to make global fairness work based on this--it seems to be more
complicated than if we had sync_vruntime().

tong
-

To: Mike Galbraith <efault@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, September 24, 2007 - 7:08 am

Odd, the idea (which I think is clear) is that min_vruntime can wrap
around the u64 spectrum. And by using min_vruntime as offset to base
the key around, we get a signed but limited range key-space. (because
we update min_vruntime to be the leftmost task (in a monotonic fashion))

So I'm having trouble with these patches, that is, both your wrap
around condition of:

if (likely(new_rq->cfs.min_vruntime))

as well as the last patchlet:

if (((s64)vruntime > (s64)min_vruntime) ||

in that neither of these changes make sense in what its trying to do.

Its perfectly valid for min_vruntime to exist in 1ULL << 63.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Tong Li <tong.n.li@...>, Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, September 24, 2007 - 7:43 am

But wrap backward timewarp is what's killing my box.

-Mike

-

To: Tong Li <tong.n.li@...>
Cc: Ingo Molnar <mingo@...>, dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Sunday, September 23, 2007 - 7:37 am

Hi,

I haven't chased down the exact scenario, but using a min_vruntime which
is about to change definitely seems to be what's causing my latency
woes. Does the below cure your fairness woes as well?

(first bit is just some debug format changes for your convenience if you
try it)

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_debug.c linux-2.6.23-rc7.d/kernel/sched_debug.c
--- git/linux-2.6.sched-devel/kernel/sched_debug.c 2007-09-22 13:37:32.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_debug.c 2007-09-21 12:12:27.000000000 +0200
@@ -67,7 +67,7 @@ print_task(struct seq_file *m, struct rq
(long long)(p->nvcsw + p->nivcsw),
p->prio);
#ifdef CONFIG_SCHEDSTATS
- SEQ_printf(m, "%15Ld.%06ld %15Ld.%06ld %15Ld.%06ld\n",
+ SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld\n",
SPLIT_NS(p->se.vruntime),
SPLIT_NS(p->se.sum_exec_runtime),
SPLIT_NS(p->se.sum_sleep_runtime));
@@ -83,10 +83,10 @@ static void print_rq(struct seq_file *m,

SEQ_printf(m,
"\nrunnable tasks:\n"
- " task PID tree-key switches prio"
- " exec-runtime sum-exec sum-sleep\n"
+ " task PID tree-key switches prio"
+ " exec-runtime sum-exec sum-sleep\n"
"------------------------------------------------------"
- "------------------------------------------------");
+ "----------------------------------------------------\n");

read_lock_irq(&tasklist_lock);

@@ -134,7 +134,7 @@ void print_cfs_rq(struct seq_file *m, in
spread0 = min_vruntime - rq0_min_vruntime;
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread0",
SPLIT_NS(spread0));
- SEQ_printf(m, " .%-30s: %ld\n", "spread0",
+ SEQ_printf(m, " .%-30s: %ld\n", "nr_sync_min_vruntime",
cfs_rq->nr_sync_min_vruntime);
}

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-22 13:37:32.00000...

To: dimm <dmitry.adamushko@...>
Cc: <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Tuesday, September 18, 2007 - 4:22 pm

the queue with your enhancements and simplifications applied looks good
here, and it booted fine on two testboxes. I've updated the
sched-devel.git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

(below is the full shortlog over current upstream.)

(I have not tested the group scheduling bits but perhaps Srivatsa would
like to do that?)

Ingo

------------------>
Dmitry Adamushko (9):
sched: clean up struct load_stat
sched: clean up schedstat block in dequeue_entity()
sched: sched_setscheduler() fix
sched: add set_curr_task() calls
sched: do not keep current in the tree and get rid of sched_entity::fair_key
sched: yield-workaround update
sched: optimize task_new_fair()
sched: simplify sched_class::yield_task()
sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()

Ingo Molnar (29):
sched: fix new-task method
sched: resched task in task_new_fair()
sched: small sched_debug cleanup
sched: debug: track maximum 'slice'
sched: uniform tunings
sched: use constants if !CONFIG_SCHED_DEBUG
sched: remove stat_gran
sched: remove precise CPU load
sched: remove precise CPU load calculations #2
sched: track cfs_rq->curr on !group-scheduling too
sched: cleanup: simplify cfs_rq_curr() methods
sched: uninline __enqueue_entity()/__dequeue_entity()
sched: speed up update_load_add/_sub()
sched: clean up calc_weighted()
sched: introduce se->vruntime
sched: move sched_feat() definitions
sched: optimize vruntime based scheduling
sched: simplify check_preempt() methods
sched: wakeup granularity fix
sched: add se->vruntime debugging
sched: sync up ->min_vruntime when going idle
sched: add more vruntime statistics
sched: debug: update exec_clock only when SCHED_DEBUG
sched: remove wait_runtime limit
sched: remove...

To: Ingo Molnar <mingo@...>
Cc: dimm <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>
Date: Tuesday, September 18, 2007 - 11:55 pm

Ingo,
I plan to test it today and send you any updates that may be
required.

--
Regards,
vatsa
-

Previous thread: [PATCH 1/1] unify DMA_..BIT_MASK definitions: v3 by Borislav Petkov on Tuesday, September 18, 2007 - 3:27 pm. (2 messages)

Next thread: [git] CFS-devel, group scheduler, fixes by Dmitry Adamushko on Tuesday, September 18, 2007 - 3:46 pm. (2 messages)