Re: Scheduling the highest priority task

Previous thread: 2.6.22.1: hang with forcedeth driver? by Timo Jantunen on Thursday, August 2, 2007 - 1:57 am. (3 messages)

Next thread: [RFC][PATCH] Removal of FASTROUTE definition include/linux/if_packet.h by Rami Rosen on Thursday, August 2, 2007 - 2:23 am. (2 messages)
From: Martin Roehricht
Date: Thursday, August 2, 2007 - 1:58 am

Hi,

perhaps someone can give me a hint what I should consider to look for in 
order to change the ("old" 2.6.21) scheduler such that it schedules the 
highest priority task of a given runqueue.
Given a multiprocessor system I currently observe that whenever there 
are two tasks on one CPU, the lower priority one is migrated to another 
CPU. But I don't realize why this happens. From looking at the source 
code I thought it should be the highest priority one (lowest bit set in 
the runqueue's bitmap) according to
	idx = sched_find_first_bit(array->bitmap);
within move_tasks(). The idx value is then used as an index (surprise) 
to the linked list of tasks of this particular priority and one task is 
picked:
	head = array->queue + idx;
         curr = head->prev;
         tmp = list_entry(curr, struct task_struct, run_list);

Is my assumption wrong? Using printk()s within this code section makes 
the system just hang completely quite soon. The schedstats do not notify 
me immediately. So I am a bit lost on how to track down or trace the 
problem.

Can anybody confirm that my observations are correct that the scheduler 
picks the lowest priority job of a runqueue for migration?
What needs to be changed in order to pick the highest priority one?

Martin
-

From: Ingo Molnar
Date: Thursday, August 2, 2007 - 4:40 am

in the SMP migration code, the 'old scheduler' indeed picks the lowest 
priority one, _except_ if that task is running on another CPU or is too 
'cache hot':

        if (skip_for_load ||
            !can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {

also, from the priority-queue at 'idx', we pick head->prev, i.e. we 
process the list in the opposite order as schedule(). (This got changed 
in CFS to process in the same direction - which is more logical and also 
yield the most cache-cold tasks for migration.)


yep, printk locks up. You can use my static tracer:

   http://people.redhat.com/mingo/latency-tracing-patches/

add explicit static tracepoints to the scheduler code you want to 
instrument via trace_special(x,y,z) calls [with parameters that interest 
you most], and you can read out the trace via:

   http://people.redhat.com/mingo/latency-tracing-patches/trace-it.c

	Ingo
-

From: Martin Roehricht
Date: Thursday, August 2, 2007 - 8:00 am

But why is it, that the scheduler picks the lowest priority one? I 
thought sched_find_first_bit() picks the index of the lowest order bit 
in the bitmap and thus the highest priority job. Is that wrong?
What needs to be changed to let the scheduler pick the highest priority 
task from a given runqueue?
I am very confused ...

Martin
-

From: Ingo Molnar
Date: Thursday, August 2, 2007 - 8:03 am

it first picks the lowest index (i.e. the highest priority active 
priority-queue), but within those tasks (each task in that priority 
queue has equal priority) the load-balancer has freedom to pick any. 
Based on performance data we went for picking from the tail of the 
queue.

	Ingo
-

From: Martin Roehricht
Date: Thursday, August 2, 2007 - 8:14 am

That's fine with me, that within the same priority-queue any task can be 
chosen. But assume two tasks with highly different priorities, such as 
105 and 135 are scheduled on the same processor and one of them is now 
to be migrated -- shouldn't be the queue with task P=105 considered 
first for migration by this code?
Both tasks would use different queues with their own linked lists, right?

Martin
-

From: Ingo Molnar
Date: Thursday, August 2, 2007 - 8:19 am

yes. What makes you believe that the lower priority one (prio 135) is 
chosen? [ as i said before, that will only be chosen if all tasks in the 
higher-priority queue (prio 105) are either already running on a CPU or 
have recently run so that the cache-hot logic skips them. ]

	Ingo
-

From: Martin Roehricht
Date: Thursday, August 2, 2007 - 8:46 am

This believe is primarily based on my observations of multiple benchmark 
runs and also on your statement earlier: »in the SMP migration code, the 
'old scheduler' indeed picks the lowest priority one«.

Perhaps it is just an unfortunate coincidence that at ~90% of the time a 
migration decision is made, the higher priority process is currently 
cache hot whereas the lower priority process is not. That would be 
unlucky for me as I would like to decide upon specific runtime 
circumstances whether the highest or the lowest priority job of a 
runqueue should be migrated to another CPU. :-/

Martin
-

From: Ingo Molnar
Date: Thursday, August 2, 2007 - 12:48 pm

oh, sorry, that was meant to be the 'highest priority one' :-/

so i think you got it all right, i just typoed that first sentence.

	Ingo
-

From: Martin Roehricht
Date: Thursday, August 2, 2007 - 2:05 pm

Okay, now I think I understood this part of the code correctly. The 
reason why I observe a continous migration of the _lower_ priority tasks 
is most probably due to the fact that the higher priority one is 
currently running, according to:
	can_migrate_task() in move_tasks(), and therein:

	if (task_running(rq, p))
		return 0;

I tracked down via an extended /proc/schedstats that my tasks fall 
frequently into this pitfall. I basically solved it by making use of the 
more active push-strategy which is called later by load_balance() once 
the move_tasks() function did not succeed. So in case I need the higher 
priority tasks, I return immediately from move_tasks().

Thanks for your help,
Martin
-

Previous thread: 2.6.22.1: hang with forcedeth driver? by Timo Jantunen on Thursday, August 2, 2007 - 1:57 am. (3 messages)

Next thread: [RFC][PATCH] Removal of FASTROUTE definition include/linux/if_packet.h by Rami Rosen on Thursday, August 2, 2007 - 2:23 am. (2 messages)