Re: [git] CFS-devel, group scheduler, fixes

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Ingo Molnar <mingo@...>
Cc: dimm <dmitry.adamushko@...>, <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>, Peter Zijlstra <a.p.zijlstra@...>, Mike Galbraith <efault@...>
Date: Wednesday, September 19, 2007 - 2:03 am

This patch attempts to improve CFS's SMP global fairness based on the new 
virtual time design.

Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Modified small_imbalance logic in find_busiest_group(). If there's small 
imbalance, move tasks from busiest to local sched_group only if the local 
group contains a CPU whose min_vruntime is the maximum among all CPUs in 
the same sched_domain. This prevents any CPU from advancing too far ahead 
in virtual time and avoids tasks thrashing between two CPUs without 
utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, 
since the load is not evenly divisible by the number of CPUs, we want the 
extra load to have a fair use of every CPU in the system.

Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task 
runs a trivial while (1) loop. The benchmark runs for 300 seconds and, at 
every T seconds, it samples for each task the following:

1. Actual CPU time the task received during the past 60 seconds.

2. Ideal CPU time it would receive under a perfect fair scheduler.

3. Lag = ideal time - actual time, where a positive lag means the task 
received less CPU time than its fair share and negative means it received 
more.

4. Error = lag / ideal time

The following shows the max and min errors among all samples for all tasks 
before and after applying the patch:

Before:

Sampling interval: 30 s
Max error: 100.00%
Min error: -25.00%

Sampling interval: 10 s
Max error: 27.62%
Min error: -25.00%

After:

Sampling interval: 30 s
Max error: 1.33%
Min error: -1.29%

Sampling interval: 10 s
Max error: 7.38%
Min error: -6.25%

The errors for the 10s sampling interval are still not as small as I had 
hoped for, but looks like it does have some improvement.

    tong

Signed-off-by: Tong Li <tong.n.li@intel.com>
---
--- linux-2.6-sched-devel-orig/kernel/sched.c	2007-09-15 22:00:48.000000000 -0700
+++ linux-2.6-sched-devel/kernel/sched.c	2007-09-18 22:10:52.000000000 -0700
@@ -1033,9 +1033,6 @@ void set_task_cpu(struct task_struct *p,
  	if (p->se.block_start)
  		p->se.block_start -= clock_offset;
  #endif
-	if (likely(new_rq->cfs.min_vruntime))
-		p->se.vruntime -= old_rq->cfs.min_vruntime -
-						new_rq->cfs.min_vruntime;

  	__set_task_cpu(p, new_cpu);
  }
@@ -1599,6 +1596,7 @@ static void __sched_fork(struct task_str
  	p->se.exec_start		= 0;
  	p->se.sum_exec_runtime		= 0;
  	p->se.prev_sum_exec_runtime	= 0;
+	p->se.vruntime			= 0;

  #ifdef CONFIG_SCHEDSTATS
  	p->se.wait_start		= 0;
@@ -2277,6 +2275,8 @@ find_busiest_group(struct sched_domain *
  		   int *sd_idle, cpumask_t *cpus, int *balance)
  {
  	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *max_vruntime_group = NULL;
+	u64 max_vruntime = 0;
  	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
  	unsigned long max_pull;
  	unsigned long busiest_load_per_task, busiest_nr_running;
@@ -2322,6 +2322,11 @@ find_busiest_group(struct sched_domain *

  			rq = cpu_rq(i);

+			if (rq->cfs.min_vruntime > max_vruntime) {
+				max_vruntime = rq->cfs.min_vruntime;
+				max_vruntime_group = group;
+			}
+
  			if (*sd_idle && rq->nr_running)
  				*sd_idle = 0;

@@ -2483,59 +2488,16 @@ group_next:
  	 * moved
  	 */
  	if (*imbalance < busiest_load_per_task) {
-		unsigned long tmp, pwr_now, pwr_move;
-		unsigned int imbn;
-
  small_imbalance:
-		pwr_move = pwr_now = 0;
-		imbn = 2;
-		if (this_nr_running) {
-			this_load_per_task /= this_nr_running;
-			if (busiest_load_per_task > this_load_per_task)
-				imbn = 1;
-		} else
-			this_load_per_task = SCHED_LOAD_SCALE;
-
-		if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
-					busiest_load_per_task * imbn) {
-			*imbalance = busiest_load_per_task;
-			return busiest;
-		}
-
-		/*
-		 * OK, we don't have enough imbalance to justify moving tasks,
-		 * however we may be able to increase total CPU power used by
-		 * moving them.
+		/* 
+		 * When there's small imbalance, move tasks only if this
+		 * sched_group contains a CPU whose min_vruntime is the 
+		 * maximum among all CPUs in the same domain.
  		 */
-
-		pwr_now += busiest->__cpu_power *
-				min(busiest_load_per_task, max_load);
-		pwr_now += this->__cpu_power *
-				min(this_load_per_task, this_load);
-		pwr_now /= SCHED_LOAD_SCALE;
-
-		/* Amount of load we'd subtract */
-		tmp = sg_div_cpu_power(busiest,
-				busiest_load_per_task * SCHED_LOAD_SCALE);
-		if (max_load > tmp)
-			pwr_move += busiest->__cpu_power *
-				min(busiest_load_per_task, max_load - tmp);
-
-		/* Amount of load we'd add */
-		if (max_load * busiest->__cpu_power <
-				busiest_load_per_task * SCHED_LOAD_SCALE)
-			tmp = sg_div_cpu_power(this,
-					max_load * busiest->__cpu_power);
-		else
-			tmp = sg_div_cpu_power(this,
-				busiest_load_per_task * SCHED_LOAD_SCALE);
-		pwr_move += this->__cpu_power *
-				min(this_load_per_task, this_load + tmp);
-		pwr_move /= SCHED_LOAD_SCALE;
-
-		/* Move if we gain throughput */
-		if (pwr_move > pwr_now)
+		if (max_vruntime_group == this)
  			*imbalance = busiest_load_per_task;
+		else
+			*imbalance = 0;
  	}

  	return busiest;
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [git] CFS-devel, group scheduler, fixes, dimm, (Tue Sep 18, 3:36 pm)
Re: [git] CFS-devel, group scheduler, fixes, Ingo Molnar, (Tue Sep 18, 4:16 pm)
Re: [git] CFS-devel, group scheduler, fixes, Tong Li, (Wed Sep 19, 2:03 am)
Re: [git] CFS-devel, group scheduler, fixes, Siddha, Suresh B, (Wed Sep 19, 3:35 pm)
Re: [git] CFS-devel, group scheduler, fixes, Tong Li, (Wed Sep 19, 4:58 pm)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Wed Sep 19, 2:28 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Wed Sep 19, 3:51 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Wed Sep 19, 4:42 am)
Re: [git] CFS-devel, group scheduler, fixes, Tong Li, (Wed Sep 19, 1:06 pm)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Thu Sep 20, 12:55 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Thu Sep 20, 3:15 am)
Re: [git] CFS-devel, group scheduler, fixes, Willy Tarreau, (Thu Sep 20, 3:48 pm)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Thu Sep 20, 10:40 pm)
Re: [git] CFS-devel, group scheduler, fixes, Willy Tarreau, (Thu Sep 20, 11:11 pm)
Re: [git] CFS-devel, group scheduler, fixes, Ingo Molnar, (Thu Sep 20, 3:51 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Thu Sep 20, 4:11 am)
Re: [git] CFS-devel, group scheduler, fixes, Tong Li, (Fri Sep 21, 11:27 pm)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Sat Sep 22, 6:01 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Sun Sep 23, 3:14 am)
Re: [git] CFS-devel, group scheduler, fixes, Tong Li, (Mon Sep 24, 2:21 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Mon Sep 24, 6:10 am)
Re: [git] CFS-devel, group scheduler, fixes, Peter Zijlstra, (Mon Sep 24, 6:24 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Mon Sep 24, 6:42 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Mon Sep 24, 7:22 am)
Re: [git] CFS-devel, group scheduler, fixes, Peter Zijlstra, (Mon Sep 24, 7:51 am)
Re: [git] CFS-devel, group scheduler, fixes, Tong Li, (Mon Sep 24, 12:43 pm)
Re: [git] CFS-devel, group scheduler, fixes, Peter Zijlstra, (Mon Sep 24, 7:08 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Mon Sep 24, 7:43 am)
Re: [git] CFS-devel, group scheduler, fixes, Mike Galbraith, (Sun Sep 23, 7:37 am)
Re: [git] CFS-devel, group scheduler, fixes, Ingo Molnar, (Tue Sep 18, 4:22 pm)
Re: [git] CFS-devel, group scheduler, fixes, Srivatsa Vaddagiri, (Tue Sep 18, 11:55 pm)