Re: [RFC][PATCH 3/6] core changes in CFS

Previous thread: [PATCH] Fix broken ifdefs in usbtouchscreen by Ondrej Zary on Monday, June 11, 2007 - 8:09 am. (1 message)

Next thread: Re: [PATCH] move the kernel to 16MB for NUMA-Q by Andrew Morton on Monday, June 11, 2007 - 9:15 am. (15 messages)
From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 8:47 am

Ingo,
	Here's an update of the group fairness patch I have been working
on. Its against CFS v16 (sched-cfs-v2.6.22-rc4-mm2-v16.patch).

The core idea is to reuse much of CFS logic to apply fairness at higher
hierarchical levels (user, container etc). In this regard CFS engine has been
modified to deal with generic 'schedulable entities'. The patches
introduce two essential structures in CFS core:

	- struct sched_entity
		- represents a schedulable entity in a hierarchy. Task
		  is the lowest element in this hierarchy. Its ancestors
		  could be user, container etc. This structure stores
		  essential attributes/execution-history (wait_runtime etc) 
		  which is required by CFS engine to provide fairness between
		  'struct sched_entities' at the same hierarchy.

	- struct lrq
		- represents (per-cpu) runqueue in which ready-to-run 
		  'struct sched_entities' are queued. The fair clock
		  calculation is split to be per 'struct lrq'.

Here's a brief description of the patches to follow:

Patches 1-3 introduce the essential changes in CFS core to support this
concept. They rework existing code w/o any (intended!) change in functionality.

Patch 4 fixes some bad interaction between SCHED_RT and SCHED_NORMAL
tasks in current CFS.

Patch 5 introduces basic changes in CFS core to support group fairness.

Patch 6 hooks up scheduler with container patches in mm (as an interface
for task-grouping functionality).


Changes since last version:

	- Prelimnary SMP support included (based on the idea outlined at
	  http://lkml.org/lkml/2007/5/25/146)
	- Task grouping to which fairness is applied is based on Paul Menage's 
	  container patches included in -mm tree. Usage of this feature
	  is described in Patch 6/6
	- Fix some real time and SCHED_NORMAL interactions (maintain
	  separate nr_running/raw_weighted counters for SCHED_NORMAL
	  tasks)
	- Support arbitrary levels of hierarchy. Previous version
	  supported only 2 levels. Current version makes no assumption
	  on ...
From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 8:50 am

This patch introduces two new structures:

struct sched_entity
        stores essential attributes/execution-history used by CFS core
        to drive fairness between 'schedulable entities' (tasks, users etc)

struct lrq
        runqueue used to hold ready-to-run entities

These new structures are formed by grouping together existing fields in
existing structures (task_struct and rq) and hence represents rework
with zero functionality change.

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 fs/proc/array.c           |    4 
 include/linux/sched.h     |   43 +++++----
 kernel/exit.c             |    2 
 kernel/posix-cpu-timers.c |   16 +--
 kernel/sched.c            |  186 +++++++++++++++++++++-------------------
 kernel/sched_debug.c      |   86 +++++++++---------
 kernel/sched_fair.c       |  211 +++++++++++++++++++++++-----------------------
 kernel/sched_rt.c         |   14 +--
 8 files changed, 289 insertions(+), 273 deletions(-)

Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h	2007-06-09 15:01:39.000000000 +0530
+++ current/include/linux/sched.h	2007-06-09 15:04:54.000000000 +0530
@@ -872,6 +872,29 @@
 	void (*task_new) (struct rq *rq, struct task_struct *p);
 };
 
+/* CFS stats for a schedulable entity (task, task-group etc) */
+struct sched_entity {
+	int load_weight;	/* for niceness load balancing purposes */
+	int on_rq;
+	struct rb_node run_node;
+	u64 wait_start_fair;
+	u64 wait_start;
+	u64 exec_start;
+	u64 sleep_start, sleep_start_fair;
+	u64 block_start;
+	u64 sleep_max;
+	u64 block_max;
+	u64 exec_max;
+	u64 wait_max;
+	u64 last_ran;
+
+	s64 wait_runtime;
+	u64 sum_exec_runtime;
+	s64 fair_key;
+	s64 sum_wait_runtime, sum_sleep_runtime;
+	unsigned long wait_runtime_overruns, wait_runtime_underruns;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -886,33 ...
From: Linus Torvalds
Date: Monday, June 11, 2007 - 11:48 am

Wouldn't this be sensible to integrate into CFS _regardless_ of anything 
else?

Especially with the alleged "pluggable" scheduler model, one of the things 
lacking in any pluggability is the fact that the core thread data 
structures are anythign but pluggable. This patch makes it no more so, but 
at least it abstracts out the fields that are used by the scheduler, and 
would seem to be a nice cleanup. No?

		Linus
-

From: Ingo Molnar
Date: Monday, June 11, 2007 - 11:56 am

yeah, that's what i asked Srivatsa to do, and his currently patchset 
largely achieves that. I'm looking now into applying the core bits and 
checking that it's in essence a NOP when this isnt used. (that makes the 
later merging of group fairness and container scheduling a lot less 
contentious)

	Ingo
-

From: Balbir Singh
Date: Monday, June 11, 2007 - 7:15 pm

Shouldn't the rq->lock move into lrq?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 8:52 pm

Right now, the per-cpu rq lock protects all (local) runqueues attached with the 
cpu. At some point, for scalability reasons, we may want to split that to
be per-cpu per-local runqueue (as you point out). I will put that in my todo
list of things to consider. Thanks for the review!

-- 
Regards,
vatsa
-

From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 8:52 am

We rely very much on task_cpu(p) to be correct at all times, so that we
can correctly find the runqueue from which the task has to be removed or
added to.

There is however one place in the scheduler where this assumption of
task_cpu(p) being correct is broken. This patch fixes that piece of
code.

(Thanks to Balbir Singh for pointing this out to me)

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 kernel/sched.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c	2007-06-09 15:07:17.000000000 +0530
+++ current/kernel/sched.c	2007-06-09 15:07:32.000000000 +0530
@@ -4624,7 +4624,7 @@
 static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 {
 	struct rq *rq_dest, *rq_src;
-	int ret = 0;
+	int ret = 0, on_rq;
 
 	if (unlikely(cpu_is_offline(dest_cpu)))
 		return ret;
@@ -4640,9 +4640,11 @@
 	if (!cpu_isset(dest_cpu, p->cpus_allowed))
 		goto out;
 
-	set_task_cpu(p, dest_cpu);
-	if (p->se.on_rq) {
+	on_rq = p->se.on_rq;
+	if (on_rq)
 		deactivate_task(rq_src, p, 0);
+	set_task_cpu(p, dest_cpu);
+	if (on_rq) {
 		activate_task(rq_dest, p, 0);
 		check_preempt_curr(rq_dest, p);
 	}
-- 
Regards,
vatsa
-

From: Balbir Singh
Date: Monday, June 11, 2007 - 7:17 pm

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 8:53 am

This patch introduces core changes in CFS work to operate on generic
schedulable entities. The task specific operations (like enqueue, dequeue, 
task_tick etc) is then rewritten to work off this generic CFS "library".

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 kernel/sched_debug.c |    2 
 kernel/sched_fair.c  |  574 ++++++++++++++++++++++++++++++---------------------
 2 files changed, 345 insertions(+), 231 deletions(-)

Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c	2007-06-09 15:07:16.000000000 +0530
+++ current/kernel/sched_fair.c	2007-06-09 15:07:33.000000000 +0530
@@ -42,19 +42,54 @@
 
 extern struct sched_class fair_sched_class;
 
+/******************************************************************************/
+/*            BEGIN : CFS operations on generic schedulable entities          */
+/******************************************************************************/
+
+static inline struct rq *lrq_rq(struct lrq *lrq)
+{
+	return container_of(lrq, struct rq, lrq);
+}
+
+static inline struct sched_entity *lrq_curr(struct lrq *lrq)
+{
+	struct rq *rq = lrq_rq(lrq);
+	struct sched_entity *se = NULL;
+
+	if (rq->curr->sched_class == &fair_sched_class)
+		se = &rq->curr->se;
+
+	return se;
+}
+
+static long lrq_nr_running(struct lrq *lrq)
+{
+	struct rq *rq = lrq_rq(lrq);
+
+	return rq->nr_running;
+}
+
+#define entity_is_task(se)	1
+
+static inline struct task_struct *entity_to_task(struct sched_entity *se)
+{
+	return container_of(se, struct task_struct, se);
+}
+
+
 /**************************************************************/
 /* Scheduling class tree data structure manipulation methods:
  */
 
 /*
- * Enqueue a task into the rb-tree:
+ * Enqueue a entity into the rb-tree:
  */
-static inline void __enqueue_task_fair(struct rq *rq, struct task_struct *p)
+static inline void __enqueue_entity(struct lrq *lrq, struct ...
From: Balbir Singh
Date: Monday, June 11, 2007 - 7:29 pm

Could you add some comments as to what this means? Should be it boolean instead


p is a general convention for tasks in the code, could we use something


Can !curr ever be true? shoudn't we look into the sched_class of the task
that the entity belongs to?


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 9:22 pm

sure. Basically this macro tests whether a given schedulable entity is
task or not. Other possible schedulable entities could be process, user,
container etc. These various entities form a hierarchy with task being



'se' perhaps as is used elsewhere. I avoided making that change so that
people will see less diff o/p in the patch :) I agree though a better

I prefer its current name, but will consider your suggestion in next

Couple of cases that we need to consider here:

CONFIG_FAIR_GROUP_SCHED disabled:

	lrq_curr() essentially returns NULL if currently running task
	doesnt belong to fair_sched_class, else it returns &rq->curr->se
	So the check for fair_sched_class is taken care in that
	function.

CONFIG_FAIR_GROUP_SCHED enabled:

	lrq_curr() returns lrq->curr. I introduced ->curr field in lrq
	to optimize on not having to update lrq's fair_clock
	(update_curr upon enqueue/dequeue task) if it was not currently 
	"active".

	Lets say that there are two groups 'vatsa' and 'guest'
	with their own lrqs on each cpu. If CPU0 is currently running a
	task from group 'vatsa', then lrq_vatsa->curr will point to
	the currently running task, while lrq_guest->curr will be 
	NULL. While the task from 'vatsa' is running, if we were to
	enqueue/dequeue task from group 'guest', we need not 
	update lrq_guest's fair_clock (as it is not active currently).
	This optimization in update_curr is made possible by maintaining
	a 'curr' field in lrq.

Hope this answers your question.

-- 
Regards,
vatsa
-

From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 8:55 am

Currently nr_running and raw_weighted_load fields in runqueue affect
some CFS calculations (like distribute_fair_add, enqueue_sleeper etc).

These fields however are shared between tasks of all classes, which can
potentialy affect those calculations for SCHED_NORMAL tasks. However I
do not know of any bad behaviour caused by not splitting these fields (like
this patch does).

This split is neverthless needed for subsequent patches.

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 kernel/sched.c      |  134 +++++++++++++++++++++++++---------------------------
 kernel/sched_fair.c |   65 ++++++++++++++++++++++++-
 2 files changed, 128 insertions(+), 71 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c	2007-06-09 15:07:32.000000000 +0530
+++ current/kernel/sched.c	2007-06-09 15:07:36.000000000 +0530
@@ -118,6 +118,7 @@
 
 /* CFS-related fields in a runqueue */
 struct lrq {
+	long nr_running;
 	unsigned long raw_weighted_load;
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
@@ -125,6 +126,7 @@
 
 	u64 fair_clock, delta_fair_clock;
 	u64 exec_clock, delta_exec_clock;
+	u64 last_tick;  /* when did we last smoothen cpu load? */
 	s64 wait_runtime;
 	unsigned long wait_runtime_overruns, wait_runtime_underruns;
 
@@ -148,12 +150,18 @@
 	 * remote CPUs use both these fields when doing load calculation.
 	 */
 	long nr_running;
-	struct lrq lrq;
+	unsigned long raw_weighted_load;
+#ifdef CONFIG_SMP
+	#define CPU_LOAD_IDX_MAX 5
+	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 
 	unsigned char idle_at_tick;
 #ifdef CONFIG_NO_HZ
 	unsigned char in_nohz_recently;
 #endif
+#endif
+	struct lrq lrq;
+
 	u64 nr_switches;
 
 	/*
@@ -589,13 +597,13 @@
 static inline void
 inc_raw_weighted_load(struct rq *rq, const struct task_struct *p)
 {
-	rq->lrq.raw_weighted_load += p->se.load_weight;
+	rq->raw_weighted_load += ...
From: Dmitry Adamushko
Date: Tuesday, June 12, 2007 - 2:03 am

[ briefly looked.. a few comments so far ]


(1)

I had an idea of per-sched-class 'load balance' calculator. So that
update_load() (as in your patch) would look smth like :

...
struct sched_class *class = sched_class_highest;
unsigned long total = 0;

do {
                total += class->update_load(..., now);
                class = class->next;
     } while (class);
...

and e.g. update_load_fair() would become a fair_sched_class :: update_load().

That said, all the sched_classes would report a load created by their
entities (tasks) over the last sampling period. Ideally, the
calculation should not be merely based on the 'raw_weighted_load' but
rather done in a similar way to update_load_fair() as in v17.

I'll take a look at how it can be mapped on the current v17 codebase
(including your patches #1-3) and come up with some real code so we
would have a base for discussion.



I think, it won't work properly this way. The first call returns a
load for last TICK_NSEC and all the consequent ones report zero load
('this_load = 0' internally).. as a result, we will get a lower load
than it likely was.

I guess, update_load_fair() (as it's in v17) could be slightly changed
to report the load for an interval of time over which the load
statistics have been accumulated (delta_exec_time and fair_exec_time):

update_load_fair(Irq, now - Irq->last_tick)

This new (second) argument would be used instead of TICK_NSEC
(internally in update_load_fair()) ... but again, I'll come up with

-- 
Best regards,
Dmitry Adamushko
-

From: Srivatsa Vaddagiri
Date: Tuesday, June 12, 2007 - 3:26 am

I like this idea. It neatly segregates load calculation across classes.
It effectively replaces what update_load() function I introduced in
Patch #4.


mm ..

        exec_delta64 = this_lrq->delta_exec_clock + 1;
        this_lrq->delta_exec_clock = 0;

So exec_delta64 (and fair_delta64) should be min 1 in successive calls. How can that lead to this_load = 0?

The idea behind 'replay lost ticks' is to avoid load smoothening of
-every- lrq -every- tick. Lets say that there are ten lrqs
(corresponding to ten different users). We load smoothen only the currently 
active lrq (whose task is currently running). Other lrqs load get smoothened 
as soon as they become active next time (thus catching up with all lost ticks).


-- 
Regards,
vatsa
-

From: Dmitry Adamushko
Date: Tuesday, June 12, 2007 - 5:23 am

Good.

(a minor disclaimer :)
We discussed it a bit with Ingo and I don't remember who first
expressed this idea in written words (although I seem to remember, I

Well, as a _temporary_ stub - just return the 'raw_weighted_load'
contributed by the RT tasks..
Ideally, we'd like a similar approach to the update_fair_load() --
i.e. we need the run-time history of rt_sched_class's (like of any
other class) tasks over the last sampling period, so e.g. we do
account periodic RT tasks which happen to escape accounting through
'raw_weghted_load' due to the fact that they are not active at the
moment of timer interrupts (when 'raw_weighted_load' snapshots are

just substitute {exec,fair}_delta == 1 in the following code:

        tmp64 = SCHED_LOAD_SCALE * exec_delta64;
        do_div(tmp64, fair_delta);
        tmp64 *= exec_delta64;
        do_div(tmp64, TICK_NSEC);
        this_load = (unsigned long)tmp64;

we'd get

        tmp64 = 1024 * 1;
        tmp64 =/ 1;
        tmp64 *= 1;
        tmp64 /= 1000000;


The raw idea behind update_load_fair() is that it evaluates the
run-time history between 2 consequent calls to it (which is now at
timer freq. --- that's a sapling period). So if you call
update_fair_load() in a loop, the sampling period is actually an
interval between 2 consequent calls. IOW, you can't say "3 ticks were
lost" so at first evaluate the load for the first tick, then the
second one, etc. ...

Anyway, I'm missing the details regarding the way you are going to do
per-group 'load balancing' so refrain from further commenting so
far... it's just that the current implementation of update_load_fair()

Ok, let's say user1 tasks were highly active till T1 moment of time..
cpu_load[] of user's lrq
has accumulated this load.
now user's tasks were not active for an interval of dT.. so you don't
update its cpu_load[] in the mean time? Let's say 'load balancing'
takes place at the moment T2 = T1 + dT

Are you going to do any 'load balancing' between users? ...
From: Srivatsa Vaddagiri
Date: Tuesday, June 12, 2007 - 6:30 am

Ok ..

But isn't that the same result we would have obtained anyways had we
called update_load_fair() on all lrq's on every timer tick? If a user's
lrq was inactive for several ticks, then its exec_delta will be seen as
zero for those several ticks, which means we would compute its 'this_load' to be
zero as well for those several ticks?

Basically what I want to know is, are we sacrificing any accuracy here
because of "deferring" smoothening of cpu_load for a (inactive) lrq
(apart from the inaccurate figure used during load_balance as you point

Assuming the lrq was inactive for all those 3 ticks and became active at
4th tick, would the end result of cpu_load (as obtained in my code) be

Even though this lost ticks loop is easily triggered with user-based lrqs, 
I think the same "loop" can be seen in current CFS code (i.e say v16)
when low level timer interrupt handler replays such lost timer ticks (say we
were in a critical section for some time with timer interrupt disabled).
As an example see arch/powerpc/kernel/time.c:timer_interrupt() calling
account_process_time->scheduler_tick in a loop.

If there is any bug in 'replay lost ticks' loop in the patch I posted, then
it should already be present in current (i.e v16) implementation of

Yes, patch #5 introduces group-aware load-balance. It is two-step:

First, we identify busiest group and busiest queue, based on
rq->raw_weighted_load/cpu_load (which is accumulation of weight from all 
clases on a CPU). This part of the code is untouched.

Next when loadbalancing between two chosen CPUs (busiest and this cpu),
move_tasks() is iteratively called on each user/group's lrq on both cpus, with
the max_load_move argument set to 1/2 the imabalnce between that user's lrqs
on both cpus. For this lrq imbalance calculation, I was using 
lrq->raw_weighted_load from both cpus, though I agree using

Good point. So how do we solve this? I really really want to avoid
running update_load_fair() on all lrq's every tick (it will be a ...
From: Dmitry Adamushko
Date: Tuesday, June 12, 2007 - 7:31 am

Yeah.. seems to be so. But let's consider whether these 'inactive ticks' are
really inactive [1] :

The fact that user's tasks are not active at the moment of a timer
interrupt doesn't mean
they were not active _during_ the last tick. That's why another
approach in update_load_fair() which doesn't depend on a snapshot of
rq->raw_weighted_load
at timer tick's time. I guess, we'd lose this with 'inactive ticks',
right? ok, maybe

At least, we are getting some inaccuracy (not in a generic case
though) due to the

        if (exec_delta64 > (u64)TICK_NSEC)
                exec_delta64 = (u64)TICK_NSEC;   [*]


With the current code, yes - it may be. In case, [*] condition (see
above) comes into play (and these 'inactive' ticks were not really


I'll take a look (e.g. I guess, we have got a notion of "user's
weght"... so does/how a user's weight contribute to his tasks weight..
otherwise, I think, the approach of determining
the busiest CPU based only on pure tasks' weight would be wrong.. will

yeahh.. have to think about it.
btw, I recall the patch #4 adds some light but noticeable overhead,

I guess, it's not only about CFS but about the users' behavior, which
is something
we can't control and so can't rely on it.
Say, a user was active till the moment T1 and then just gone.. - all
his tasks are really inactive.
So at the moment T2 user's Irq :: cpu_load will still express the
situation at the moment T1?
As long as user's lrq is not involved in 'load balancing', this
inaccuracy can be revealed only if the info is exported via /proc.

But say, user's task becomes finally active after _a lot_ of inactive
ticks (the user came back).. now it's in the rq and waiting for its
turn (which can be easily > 1 tick).. in the mean time 'load
balancing' is triggered.. and it considers the old lrq :: cpu_load[]
...


P.S.

just a personal impression.. I'm quite confused by this 'lrq' name...
it looks pretty similar to 'Irq' (with a big 'i') and I can't stop
reading it as ...
From: Srivatsa Vaddagiri
Date: Tuesday, June 12, 2007 - 8:43 am

Sorry this is not clear. We'd lose what with 'inactive' ticks?

If you are referring to the delta execution time a user's lrq consumed
in the middle of a tick and whether we would lose them during subsequent
update_load(), the answer IMO is no. Becasue put_prev_task_fair() would
account for the small delta execution time when the task/lrq got

I would say any lossy accounting would be bad in long run. If you think
put_prev_task_fair()->update_curr() still leaves open some problem, I

If that is a problem (and I tend to agree that it is), then it is not unique to 


Yes sure, we need to fix that assumption that exec_delta64 can't be


A user's weight controls fraction of CPU the user's tasks as a whole receive.

A task's weight controls fraction of CPU the task will receive -within-
the fraction alloted to that user.

Strictly speaking, a task's weight need not have to depend on its user's
weight. This is true if scheduler core recognizes both user and task levels of
scheduling in the hierarchy. If the scheduler were to recognize fewer
levels of hierarchy, then we will have to take into account a user's
weight in calculation task weight. See thread anchored at 

The load considered for determining busiest group/queue is the summation
of -all- task's load on a CPU. That's why I introduced update_load() in
Patch #4 which captures load from real-time tasks as well as
SCHED_NORMAL tasks. When you are changing that update_load() function
(based on class->update_load callback), it would be nice to keep this in
mind (that I need a weight field representing summation of all tasks

This probably comes from the split up raw_weighted_load/nr_running
fields. Although I don't know if the overhead is that noticeable in

I still think that this 'stale' cpu_load data won't last long enough to
serious affect load balance decisions. But something I agree definitely to 

I chose lrq to mean local run queue. Other names I can think of are
entity_rq or ...actually cfs_rq (as you ...
From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 8:56 am

This patch introduces the core changes in CFS required to accomplish
group fairness at higher levels. It also modifies load balance interface
between classes a bit, so that move_tasks (which is centric to load
balance) can be reused to balance between runqueues of various types
(struct rq in case of SCHED_RT tasks, struct lrq in case of
SCHED_NORMAL/BATCH tasks).

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 include/linux/sched.h |   14 ++
 kernel/sched.c        |  155 ++++++++++++++++--------------
 kernel/sched_fair.c   |  252 +++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched_rt.c     |   49 ++++++++-
 4 files changed, 367 insertions(+), 103 deletions(-)

Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h	2007-06-09 15:04:54.000000000 +0530
+++ current/include/linux/sched.h	2007-06-09 15:07:37.000000000 +0530
@@ -866,8 +866,13 @@
 	struct task_struct * (*pick_next_task) (struct rq *rq, u64 now);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p, u64 now);
 
-	struct task_struct * (*load_balance_start) (struct rq *rq);
-	struct task_struct * (*load_balance_next) (struct rq *rq);
+#ifdef CONFIG_SMP
+	int (*load_balance) (struct rq *this_rq, int this_cpu,
+			struct rq *busiest,
+		 	unsigned long max_nr_move, unsigned long max_load_move,
+			struct sched_domain *sd, enum idle_type idle,
+			int *all_pinned, unsigned long *total_load_moved);
+#endif
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
 	void (*task_new) (struct rq *rq, struct task_struct *p);
 };
@@ -893,6 +898,11 @@
 	s64 fair_key;
 	s64 sum_wait_runtime, sum_sleep_runtime;
 	unsigned long wait_runtime_overruns, wait_runtime_underruns;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	struct sched_entity *parent;
+	struct lrq *lrq,   /* runqueue on which this entity is (to be) queued */
+		   *my_q;  /* runqueue "owned" by this entity/group */
+#endif
 };
 
 struct ...
From: Dmitry Adamushko
Date: Wednesday, June 13, 2007 - 1:56 pm

IMHO, it looks a bit frightening :) maybe it would be possible to
create a structure that combines some relevant argumens .. at least,

I think, there is a possible problem here. If I'm not complete wrong,
this function (move_tasks() in the current mainline) can move more
'load' than specified by the 'max_load_move'..



'(long)max_load_move > 0'      ?



'(long)rem_load_move <= 0'


-- 
Best regards,
Dmitry Adamushko
-

From: Srivatsa Vaddagiri
Date: Thursday, June 14, 2007 - 5:06 am

I agree :) It is taking (ooops) 15 args (8 perhaps was the previous record

How does this look?

struct balance_tasks_args {
	struct rq *this_rq, struct rq *busiest;
	unsigned long max_nr_move, unsigned long max_load_move;
	struct sched_domain *sd, enum idle_type idle;
	int this_best_prio, best_prio, best_prio_seen;
	int *all_pinned;
	unsigned long *load_moved;
	void *iterator_arg;
	struct task_struct *(*iterator_start)(void *arg);
	struct task_struct *(*iterator_next)(void *arg));
};

static int balance_tasks(struct balance_tasks_args *arg);

[ down to one argument now! ]

?


Yes I think you are right. I will tackle this in next iteration.

Thanks for all your review so far!

-- 
Regards,
vatsa
-

From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 8:58 am

This patch hooks up cpu scheduler with Paul Menage's container
infrastructure.

The container patches allows administrator to create arbitrary groups of tasks
and define resource allocation for each group. By registering with container
infrastructure, cpu scheduler is made aware of group membership information for
each task, creation/deletion of groups etc and can use that information to
provide fairness between groups.

This mechanism can indirectly be used to provide fairness between users
also. All that is needed is a user-space program (which I am working on
and will post later) which monitors for PROC_EVENT_UID events (using
process event connector) and moves the task to appropriate user-directory in
container filesystem.

As an example for "HOWTO use this feature", follow these steps:

        1. Define CONFIG_FAIR_GROUP_SCHED (General Setup->Fair Group Scheduler)
           and compile the kernel
        2. After booting:

        # cd /dev
        # mkdir cpuctl
        # mount -t container -ocpuctl none /dev/cpuctl
        # cd cpuctl
        # mkdir grpA
        # mkdir grpB

        # echo some_pid1 > grpA/tasks
        # echo some_pid2 > grpA/tasks
        # echo some_pid3 > grpA/tasks
        # echo some_pid4 > grpA/tasks

        ...

        # echo another_pidX > grpB/tasks
        # echo another_pidY > grpB/tasks

All tasks in grpA/tasks should cumulatively share same CPU as all tasks
in grpB/tasks.

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 include/linux/container_subsys.h |    6 +
 include/linux/sched.h            |    1 
 init/Kconfig                     |    8 +
 kernel/sched.c                   |  234 ++++++++++++++++++++++++++++++++++++++-
 kernel/sched_fair.c              |   36 ++++--
 5 files changed, 274 insertions(+), 11 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c	2007-06-09 15:07:37.000000000 +0530
+++ ...
From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 9:02 am

[snip]

	+ Fix CFS debug interface to be group aware

-- 
Regards,
vatsa
-

From: Ingo Molnar
Date: Monday, June 11, 2007 - 12:37 pm

i currently have these 3 patches applied to the CFS queue and it's 
looking pretty good so far! If it continues to be problem-free i'll 
release them as part of -v17, just to check that they truly have no bad 
side-effects (they shouldnt). Then #4 can go into -v18.

i've attached my current -v17 tree - it should apply mostly cleanly 
ontop of the -mm queue (with a minor number of fixups). Could you 
refactor the remaining 3 patches ontop of this base? There's some 

btw., the plan here is to turn off 'bit 0' in sched_features: i.e. to 
use the precise statistics to calculate lrq->cpu_load[], not the 
timer-irq-sampled imprecise statistics. Dmitry has fixed a couple of 
bugs in it that made it not work too well in previous CFS versions, but 
now we are ready to turn it on for -v17. (indeed in my tree it's already 

ok. Kirill, how do you like Srivatsa's current approach? Would be nice 

you'll get the best hackbench results by using SCHED_BATCH:

   chrt -b 0 ./hackbench 10

or indeed increasing the runtime_limit would work too.

	Ingo
From: Ingo Molnar
Date: Monday, June 11, 2007 - 12:39 pm

i mean bit 6, value 64. I flipped around its meaning in -v17-rc4, so the 
new precise stats code there is now default-enabled - making SMP 
load-balancing more accurate.

	Ingo
-

From: Srivatsa Vaddagiri
Date: Monday, June 11, 2007 - 10:50 pm

ok. I am also most concerned about not upsetting current performance of
CFS when CONFIG_FAIR_GROUP_SCHED is turned off. Staging these patches in



I must be missing something here. AFAICS, cpu_load calculation still is
timer-interrupt driven in the -v17 snapshot you sent me. Besides, there
is no change in default value of bit 6 b/n v16 and v17:

-unsigned int sysctl_sched_features __read_mostly = 1 | 2 | 4 | 8 | 0 | 0;
+unsigned int sysctl_sched_features __read_mostly = 0 | 2 | 4 | 8 | 0 | 0;

So where's this precise stats based calculation of cpu_load?

Anyway, do you agree that splitting the cpu_load/nr_running fields so that:

rq->nr_running 	   	  = total count of -all- tasks in runqueue
rq->raw_weighted_load	  = total weight of -all- tasks in runqueue
rq->lrq.nr_running 	  = total count of SCHED_NORMAL/BATCH tasks in runqueue
rq->lrq.raw_weighted_load = total weight of SCHED_NORMAL/BATCH tasks in runqueue

is a good thing to avoid SCHED_RT<->SCHED_NORMAL/BATCH mixup (as accomplished 
in Patch #4)?

If you don't agree, then I will make this split dependent on

Just to be clear, by container patches, I am referring to "process" container
patches from Paul Menage [1]. They aren't necessarily tied to
"virtualization-related" container support in -mm tree, although I
believe that "virtualization-related" container patches will make use of the 
same "process-related" container patches for their task-grouping requirements. 

One thing the current patches don't support is the notion of virtual
cpus (which Kirill and other "virtualization-related" container folks would 
perhaps want). IMHO, the current patches can still be usefull for
containers to load balance between those virtual cpus (as and when it is 



References:

1.  https://lists.linux-foundation.org/pipermail/containers/2007-May/005261.html

-- 
Regards,
vatsa
-

From: Ingo Molnar
Date: Monday, June 11, 2007 - 11:26 pm

but there's a change in the interpretation of bit 6:

-       if (!(sysctl_sched_features & 64)) {
-               this_load = this_rq->raw_weighted_load;
+       if (sysctl_sched_features & 64) {
+               this_load = this_rq->lrq.raw_weighted_load;

the update of the cpu_load[] value is timer interrupt driven, but the 
_value_ that is sampled is not. Previously we used ->raw_weighted_load 
(at whatever value it happened to be at the moment the timer irq hit the 
system), now we basically use a load derived from the fair-time passed 
since the last scheduler tick. (Mathematically it's close to an integral 
of load done over that period) So it takes all scheduling activities and 
all load values into account to calculate the average, not just the 
value that was sampled by the scheduler tick.

this, besides being more precise (it for example correctly samples 
short-lived, timer-interrupt-driven workloads too, which were largely 
'invisible' to the previous load calculation method), also enables us to 
make the scheduler tick hrtimer based in the (near) future. (in essence 

yes, i agree in general, even though this causes some small overhead. 
This also has another advantage: the inter-policy isolation and load 
balancing is similar to what fair group scheduling does, so even 'plain' 


i'd still like to hear back from Kirill & co whether this framework is 
flexible enough for their work (OpenVZ, etc.) too.

	Ingo
-

From: Kirill Korotaev
Date: Friday, June 15, 2007 - 5:46 am

My IMHO is that so far the proposed group scheduler doesn't look ready/suitable.
We need to have a working SMP version before it will be clear
whether the whole approach is good and works correct on variety of load patterns.

Thanks,
Kirill

-

From: Srivatsa Vaddagiri
Date: Friday, June 15, 2007 - 7:06 am

Hi Kirill,
	Yes its work-in-progress and hence is not ready/fully-functional
(yet). The patches I posted last gives an idea of the direction it is
heading. For ex: http://lkml.org/lkml/2007/6/11/162 and
http://lkml.org/lkml/2007/5/25/146 gives an idea of how SMP load balance will 
works.

IMHO the nice thing about this approach is it (re)uses lot of code in
scheduler which is there already to achieve fairness between tasks and
higher schedulable elements (users/containers etc).

Also with CFS engine's precise nanosecond accurate accounting and time-sorted 
list of tasks/entities, I feel we will get much tighter control over 

If you have any headsup thoughts on areas/workloads where this may pose problems
for container/user scheduling, I would be glad to hear them. Otherwise I would 
greatly wellcome any help in developing/reviewing these patches which meets 
both our goals!

-- 
Regards,
vatsa
-

Previous thread: [PATCH] Fix broken ifdefs in usbtouchscreen by Ondrej Zary on Monday, June 11, 2007 - 8:09 am. (1 message)

Next thread: Re: [PATCH] move the kernel to 16MB for NUMA-Q by Andrew Morton on Monday, June 11, 2007 - 9:15 am. (15 messages)