Ingo Molnar, author of the O(1) scheduler [earlier story] and the orginal preemptive kernel patch, has provided a patch to make the O(1) scheduler fully aware of HyperThreading. Ingo explains:
"Symmetric multithreading (hyperthreading) is an interesting new concept that IMO deserves full scheduler support. Physical CPUs can have multiple (typically 2) logical CPUs embedded, and can run multiple tasks 'in parallel' by utilizing fast hardware-based context-switching between the two register sets upon things like cache-misses or special instructions. To the OSs the logical CPUs are almost undistinguishable from physical CPUs. In fact the current scheduler treats each logical CPU as a separate physical CPU - which works but does not maximize multiprocessing performance on SMT/HT boxes."
Read on for Ingo's full explanation.
From: Ingo Molnar
Subject: [patch] "fully HT-aware scheduler" support, 2.5.31-BK-curr
Date: Tue, 27 Aug 2002 03:44:23 +0200 (CEST)
symmetric multithreading (hyperthreading) is an interesting new concept
that IMO deserves full scheduler support. Physical CPUs can have multiple
(typically 2) logical CPUs embedded, and can run multiple tasks 'in
parallel' by utilizing fast hardware-based context-switching between the
two register sets upon things like cache-misses or special instructions.
To the OSs the logical CPUs are almost undistinguishable from physical
CPUs. In fact the current scheduler treats each logical CPU as a separate
physical CPU - which works but does not maximize multiprocessing
performance on SMT/HT boxes.
The following properties have to be provided by a scheduler that wants to
be 'fully HT-aware':
- HT-aware passive load-balancing: the irq-driven balancing has to be
per-physical-CPU, not per-logical-CPU.
Otherwise it might happen that one physical CPU runs 2 tasks, while
another physical CPU runs no threads. The stock scheduler does not
recognize this condition as 'imbalance' - to the scheduler it appears
as if the first two CPUs had 1-1 task running, the second two CPUs had
0-0 tasks running. The stock scheduler does not realize that the two
logical CPUs belong to the same physical CPU.
- 'active' load-balancing when a logical CPU goes idle and thus causes a
physical CPU imbalance.
This is a mechanism that simply does not exist in the stock 1:1
scheduler - the imbalance caused by an idle CPU can be solved via the
normal load-balancer. In the HT case the situation is special because
the source physical CPU might have just two tasks running, both
runnable - this is a situation that the stock load-balancer is unable
to handle - running tasks are hard to be migrated away. But it's
essential to do this - otherwise a physical CPU can get stuck running 2
tasks, while another physical CPU stays idle.
- HT-aware task pickup.
When the scheduler picks a new task, it should prefer all tasks that
share the same physical CPU - before trying to pull in tasks from other
CPUs. The stock scheduler only picked tasks that were scheduled to that
particular logical CPU.
- HT-aware affinity.
Tasks should attempt to 'stick' to physical CPUs, not logical CPUs.
- HT-aware wakeup.
again this is something completely new - the stock scheduler only knows
about the 'current' CPU, it does not know about any sibling [== logical
CPUs on the same physical CPU] logical CPUs. On HT, if a thread is
woken up on a logical CPU that is already executing a task, and if a
sibling CPU is idle, then the sibling CPU has to be woken up and has to
execute the newly woken up task immediately.
the attached patch (against 2.5.31-BK-curr) implements all the above
HT-scheduling needs by introducing the concept of a shared runqueue:
multiple CPUs can share the same runqueue. A shared, per-physical-CPU
runqueue magically fulfills all the above HT-scheduling needs. Obviously
this complicates scheduling and load-balancing somewhat (see the patch for
details), so great care has been taken to not impact the non-HT schedulers
(SMP, UP). In fact the SMP scheduler is a compile-time special case of the
HT scheduler. (and the UP scheduler is a compile-time special case of the
the patch is based on Jun Nakajima's prototyping work - the lowlevel
x86/Intel bits are still those from Jun, the sched.c bits are newly
implemented and generalized.
There's a single flexible interface for lowlevel boot code to set up
physical CPUs: sched_map_runqueue(cpu1, cpu2) maps cpu2 into cpu1's
runqueue. The patch also implements the lowlevel bits for P4 HT boxes for
the 2/package case.
(NUMA systems which have tightly coupled CPUs with a smaller cache and
protected by a large L3 cache might benefit from sharing the runqueue as
well - but the target for this concept is SMT.)
compiling a standalone floppy.c in an infinite loop takes 2.55 seconds per
iteration. Starting up two such loops in parallel, on a 2-physical,
2-logical (total of 4 logical CPUs) P4 HT box gives the following numbers:
2.5.31-BK-curr: - fluctuates between 2.60 secs and 4.6 seconds.
BK-curr + sched-F3: - stable 2.60 sec results.
the results under the stock scheduler depends on pure luck: which CPUs get
the tasks scheduled. In the HT-aware case each task gets scheduled on a
separate physical CPU, all the time.
compiling the kernel source via "make -j2" [under-utilizes CPUs]:
2.5.31-BK-curr: 45.3 sec
BK-curr + sched-F3: 41.3 sec
ie. a ~10% improvement. The tests were the best results picked from lots
of (>10) runs. The no-HT numbers fluctuate much more (again the randomness
effect), so the average compilation time in the no-HT case is higher.
saturated compilation "make -j5" results are roughly equivalent, as
expected - the one-runqueue-per-CPU concept works adequately when the
number of tasks is larger than the number of logical CPUs. The stock
scheduler works well on HT boxes in the boundary conditions: when there's
1 task running, and when there's more nr_cpus tasks running.
the patch also unifies some of the other code and removes a few more
#ifdef CONFIG_SMP branches from the scheduler proper.
(the patch compiles/boots/works just fine on UP and SMP as well, on the P4
box and on another PIII SMP box as well.)
Testreports, comments, suggestions welcome,