Hi,
perhaps someone can give me a hint what I should consider to look for in
order to change the ("old" 2.6.21) scheduler such that it schedules the
highest priority task of a given runqueue.
Given a multiprocessor system I currently observe that whenever there
are two tasks on one CPU, the lower priority one is migrated to another
CPU. But I don't realize why this happens. From looking at the source
code I thought it should be the highest priority one (lowest bit set in
the runqueue's bitmap) according to
idx = sched_find_first_bit(array->bitmap);
within move_tasks(). The idx value is then used as an index (surprise)
to the linked list of tasks of this particular priority and one task is
picked:
head = array->queue + idx;
curr = head->prev;
tmp = list_entry(curr, struct task_struct, run_list);
Is my assumption wrong? Using printk()s within this code section makes
the system just hang completely quite soon. The schedstats do not notify
me immediately. So I am a bit lost on how to track down or trace the
problem.
Can anybody confirm that my observations are correct that the scheduler
picks the lowest priority job of a runqueue for migration?
What needs to be changed in order to pick the highest priority one?
Martin
-
in the SMP migration code, the 'old scheduler' indeed picks the lowest
priority one, _except_ if that task is running on another CPU or is too
'cache hot':
if (skip_for_load ||
!can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
also, from the priority-queue at 'idx', we pick head->prev, i.e. we
process the list in the opposite order as schedule(). (This got changed
in CFS to process in the same direction - which is more logical and also
yield the most cache-cold tasks for migration.)
yep, printk locks up. You can use my static tracer:
http://people.redhat.com/mingo/latency-tracing-patches/
add explicit static tracepoints to the scheduler code you want to
instrument via trace_special(x,y,z) calls [with parameters that interest
you most], and you can read out the trace via:
http://people.redhat.com/mingo/latency-tracing-patches/trace-it.c
Ingo
-
But why is it, that the scheduler picks the lowest priority one? I thought sched_find_first_bit() picks the index of the lowest order bit in the bitmap and thus the highest priority job. Is that wrong? What needs to be changed to let the scheduler pick the highest priority task from a given runqueue? I am very confused ... Martin -
it first picks the lowest index (i.e. the highest priority active priority-queue), but within those tasks (each task in that priority queue has equal priority) the load-balancer has freedom to pick any. Based on performance data we went for picking from the tail of the queue. Ingo -
That's fine with me, that within the same priority-queue any task can be chosen. But assume two tasks with highly different priorities, such as 105 and 135 are scheduled on the same processor and one of them is now to be migrated -- shouldn't be the queue with task P=105 considered first for migration by this code? Both tasks would use different queues with their own linked lists, right? Martin -
yes. What makes you believe that the lower priority one (prio 135) is chosen? [ as i said before, that will only be chosen if all tasks in the higher-priority queue (prio 105) are either already running on a CPU or have recently run so that the cache-hot logic skips them. ] Ingo -
This believe is primarily based on my observations of multiple benchmark runs and also on your statement earlier: »in the SMP migration code, the 'old scheduler' indeed picks the lowest priority one«. Perhaps it is just an unfortunate coincidence that at ~90% of the time a migration decision is made, the higher priority process is currently cache hot whereas the lower priority process is not. That would be unlucky for me as I would like to decide upon specific runtime circumstances whether the highest or the lowest priority job of a runqueue should be migrated to another CPU. :-/ Martin -
oh, sorry, that was meant to be the 'highest priority one' :-/ so i think you got it all right, i just typoed that first sentence. Ingo -
Okay, now I think I understood this part of the code correctly. The reason why I observe a continous migration of the _lower_ priority tasks is most probably due to the fact that the higher priority one is currently running, according to: can_migrate_task() in move_tasks(), and therein: if (task_running(rq, p)) return 0; I tracked down via an extended /proc/schedstats that my tasks fall frequently into this pitfall. I basically solved it by making use of the more active push-strategy which is called later by load_balance() once the move_tasks() function did not succeed. So in case I need the higher priority tasks, I return immediately from move_tasks(). Thanks for your help, Martin -
