Linux: HyperThreading-Aware Scheduler

Submitted by Jeremy
on August 28, 2002 - 12:59pm

Ingo Molnar, author of the O(1) scheduler [earlier story] and the orginal preemptive kernel patch, has provided a patch to make the O(1) scheduler fully aware of HyperThreading. Ingo explains:

"Symmetric multithreading (hyperthreading) is an interesting new concept that IMO deserves full scheduler support. Physical CPUs can have multiple (typically 2) logical CPUs embedded, and can run multiple tasks 'in parallel' by utilizing fast hardware-based context-switching between the two register sets upon things like cache-misses or special instructions. To the OSs the logical CPUs are almost undistinguishable from physical CPUs. In fact the current scheduler treats each logical CPU as a separate physical CPU - which works but does not maximize multiprocessing performance on SMT/HT boxes."

Read on for Ingo's full explanation.


From: Ingo Molnar
To: linux-kernel
Subject: [patch] "fully HT-aware scheduler" support, 2.5.31-BK-curr
Date: 	Tue, 27 Aug 2002 03:44:23 +0200 (CEST)

symmetric multithreading (hyperthreading) is an interesting new concept
that IMO deserves full scheduler support. Physical CPUs can have multiple
(typically 2) logical CPUs embedded, and can run multiple tasks 'in
parallel' by utilizing fast hardware-based context-switching between the
two register sets upon things like cache-misses or special instructions.
To the OSs the logical CPUs are almost undistinguishable from physical
CPUs. In fact the current scheduler treats each logical CPU as a separate
physical CPU - which works but does not maximize multiprocessing
performance on SMT/HT boxes.

The following properties have to be provided by a scheduler that wants to
be 'fully HT-aware':

  • HT-aware passive load-balancing: the irq-driven balancing has to be per-physical-CPU, not per-logical-CPU. Otherwise it might happen that one physical CPU runs 2 tasks, while another physical CPU runs no threads. The stock scheduler does not recognize this condition as 'imbalance' - to the scheduler it appears as if the first two CPUs had 1-1 task running, the second two CPUs had 0-0 tasks running. The stock scheduler does not realize that the two logical CPUs belong to the same physical CPU.
  • 'active' load-balancing when a logical CPU goes idle and thus causes a physical CPU imbalance. This is a mechanism that simply does not exist in the stock 1:1 scheduler - the imbalance caused by an idle CPU can be solved via the normal load-balancer. In the HT case the situation is special because the source physical CPU might have just two tasks running, both runnable - this is a situation that the stock load-balancer is unable to handle - running tasks are hard to be migrated away. But it's essential to do this - otherwise a physical CPU can get stuck running 2 tasks, while another physical CPU stays idle.
  • HT-aware task pickup. When the scheduler picks a new task, it should prefer all tasks that share the same physical CPU - before trying to pull in tasks from other CPUs. The stock scheduler only picked tasks that were scheduled to that particular logical CPU.
  • HT-aware affinity. Tasks should attempt to 'stick' to physical CPUs, not logical CPUs.
  • HT-aware wakeup. again this is something completely new - the stock scheduler only knows about the 'current' CPU, it does not know about any sibling [== logical CPUs on the same physical CPU] logical CPUs. On HT, if a thread is woken up on a logical CPU that is already executing a task, and if a sibling CPU is idle, then the sibling CPU has to be woken up and has to execute the newly woken up task immediately.
the attached patch (against 2.5.31-BK-curr) implements all the above HT-scheduling needs by introducing the concept of a shared runqueue: multiple CPUs can share the same runqueue. A shared, per-physical-CPU runqueue magically fulfills all the above HT-scheduling needs. Obviously this complicates scheduling and load-balancing somewhat (see the patch for details), so great care has been taken to not impact the non-HT schedulers (SMP, UP). In fact the SMP scheduler is a compile-time special case of the HT scheduler. (and the UP scheduler is a compile-time special case of the SMP scheduler) the patch is based on Jun Nakajima's prototyping work - the lowlevel x86/Intel bits are still those from Jun, the sched.c bits are newly implemented and generalized. There's a single flexible interface for lowlevel boot code to set up physical CPUs: sched_map_runqueue(cpu1, cpu2) maps cpu2 into cpu1's runqueue. The patch also implements the lowlevel bits for P4 HT boxes for the 2/package case. (NUMA systems which have tightly coupled CPUs with a smaller cache and protected by a large L3 cache might benefit from sharing the runqueue as well - but the target for this concept is SMT.) some numbers: compiling a standalone floppy.c in an infinite loop takes 2.55 seconds per iteration. Starting up two such loops in parallel, on a 2-physical, 2-logical (total of 4 logical CPUs) P4 HT box gives the following numbers:
  2.5.31-BK-curr:     - fluctuates between 2.60 secs and 4.6 seconds.

BK-curr + sched-F3: - stable 2.60 sec results.

the results under the stock scheduler depends on pure luck: which CPUs get
the tasks scheduled. In the HT-aware case each task gets scheduled on a
separate physical CPU, all the time.

compiling the kernel source via "make -j2" [under-utilizes CPUs]:

  2.5.31-BK-curr:              45.3 sec

BK-curr + sched-F3: 41.3 sec

ie. a ~10% improvement. The tests were the best results picked from lots
of (>10) runs. The no-HT numbers fluctuate much more (again the randomness
effect), so the average compilation time in the no-HT case is higher.

saturated compilation "make -j5" results are roughly equivalent, as
expected - the one-runqueue-per-CPU concept works adequately when the
number of tasks is larger than the number of logical CPUs. The stock
scheduler works well on HT boxes in the boundary conditions: when there's
1 task running, and when there's more nr_cpus tasks running.

the patch also unifies some of the other code and removes a few more
#ifdef CONFIG_SMP branches from the scheduler proper.

(the patch compiles/boots/works just fine on UP and SMP as well, on the P4
box and on another PIII SMP box as well.)

Testreports, comments, suggestions welcome,

Ingo

[patch]


Related Links:

which CPUs?

on
August 28, 2002 - 3:24pm

Which CPUs have this capability?

If one has a one-cpu system with this capability, does one have to compile for SMP to get the advantage of HT under linux? before this patch? after this patch?

re: which CPUs?

on
August 28, 2002 - 3:44pm

I'm only aware of Intel Xeon's with this ability. According to this page:

"Hyper-Threading Technology is a groundbreaking innovation from Intel

re: which CPUs?

Anonymous
on
August 28, 2002 - 4:33pm

Anyone else know of other simultaneous multi-threading capable pr
ocessors?

SPARC is heading this way too I believe. If not there already.

Re: Which CPUs?

on
August 28, 2002 - 5:00pm

It should be noted, to avoid confusion, that "Intel Xeon" refers to the Pentium4 Xeon, and not any of the previous Xeons. (many people get confused by this)

Also, currently the kernel sees _2_ CPUs in each CPU.

And you are right, you do need to compile SMP support to take advantage of HT.

--
I like short comments

tera threaded processor

on
August 28, 2002 - 6:16pm

Is the hyperthreading on the Xeon similar to Tera's (now branded as Cray) MTA (multithreaded architecture) system?

Re: tera threaded processor

Anonymous
on
August 30, 2002 - 7:10pm

It's similar only on the most basic level, in the sense that both involve more than one hardware context in the CPU. Of course, the Tera MTA has 128 hardware contexts per CPU, not 2, and a lot of extra hardware for handling the distributed memory architecture.

re: which CPUs?

on
August 28, 2002 - 7:58pm

Anyone else know of other simultaneous multi-threading capable processors?

After reading up on it, I think its very sexy. I want one. Actually two. Yes, two. :)

BTW, I think that ia64 might also be described as SMT. However, I don't know how that effects its emulation modes.

IA-64 does not feature SMT.

Anonymous
on
August 28, 2002 - 10:26pm

IA-64 does not feature SMT.

SMT != VLIW && SMT != EPIC && fabs(EPIC-VLIW) < epsilon

Anonymous
on
September 5, 2002 - 8:50pm

SMT gives you parallelism across tasks, by having the CPU present itself to the OS and the programmer as multiple CPUs. VLIW / EPIC gives you parallelism within a single sequence of code, by allowing you to schedule multiple instructions that may be issued in parallel to multiple functional units. (The alternative path to instruction-level parallelism within a task is superscalar issue. This requires scheduling hardware in the CPU. Most current desktop/server CPUs, except IA64, are of this type.) Multithreading and within-thread instruction-level parallelism are completely different paths to parallelism. The two concepts are orthogonal.

On a superscalar machine, one way you can get SMT behavior by tracking "virtual CPU number" alongside the register number in the register renaming logic. You also have to carry this fiction through to the memory system, by appending the virtual CPU info to addresses, so that the memory protection and privelege checks all work properly. This allows all SMT threads to make progress on a fine-grain level, with one task filling the other tasks' gaps.

On a VLIW, the most likely way you'd get SMT is to just interleave instruction packets. VLIWs lack instruction scheduling hardware, so the only way you can get any sort of multithreading is to do so in a fairly predictable clockwork manner. Traditional VLIWs have exposed pipelines. It so happens that SMT on VLIW provides a path to deeper pipelines: if you double your pipeline depth, you can hide it by interleaving two threads, exposing a virtual pipeline that's the same length as your original pipeline length.

On EPIC, the story is similar to VLIW, although EPIC supports a protected pipeline. Thus, you get much more flexibility on when you switch between threads.

On either EPIC or VLIW, you will still need to tag register numbers with virtual CPU (so results go to the correct register file) and likewise with addresses (so protection/privelege checks happen correctly).

--Joe

Its not that useful

on
August 28, 2002 - 10:29pm

Tom's Hardware did some testing with the HT processors a while ago. Their conclusion a Hyperthreading is not that useful:

A few final thoughts on Intel's Hyperthreading technology, which virtually doubles the number of processors: when using typical applications that are optimized for dual processing, Hyperthreading brings no advantages with it. Rather, the overhead on data slows down the application. Only software that is specially adapted for Hyperthreading enables an increase in performance. In addition, when Hyperthreading is activated, the memory performance decreases drastically, which is partially reflected by the memory benchmark.

http://www17.tomshardware.com/cpu/02q1/0203131/dual-10.html

So until gcc supports hyperthreading, don't waste your money on a Xeon.

--Nate

Re: It's Not that Useful

on
August 28, 2002 - 11:21pm

True, it's not very useful (at least not yet; and it may never be). Another problem is that you need to turn SMP on to use HT, which creates extra overhead (i.e., more complex locking).

I also spotted the following comment sometime ago in lkml, not sure if it's true, but it makes sense somewhat:

"hyperthreading will give you some performance boostes, but *only*
if you have many runable processes a majority of the time, or very heavily
threaded applications running on the system. (an example would
be running 4 setiathome clients on a dual processor system).
"

--
I like short comments

Re: It's Not that Useful

on
August 29, 2002 - 8:06am

...and that would be as opposed to all the people running dual Pentium 4 Xeons on a workstation that surfs the web and runs a mail client? ;-)

On modern systems, I'd expect at least some benefit

Anonymous
on
August 29, 2002 - 8:43am

On modern systems, I'd expect at least some benefit. In Ingo's example, he was compiling floppy.c repeatedly, which took 2.6 seconds per iteration. When two iterations ended up executing on one CPU, the total time for the two iterations went up to 4.6 seconds. That's still a a ~10% speedup, as two iterations in serial would take 5.2 seconds.

Why do I say modern systems would benefit, then? Typically, we have multiple processes come ready in small bursts. Consider scrolling a webpage or resizing a window: The X server becomes runnable, the web browser becomes runnable, and maybe even the window manager becomes runnable all at once. If we assume a 10% lift from SMT, then the benefit of SMT over just a single-processor setup would be a potential 10% drop in latency, which is a very good thing.

You mention SMP locking as an overall performance drawback. You're right, and I believe that tends to hit I/O intensive benchmarks and batch-oriented heavy compute jobs. It's also a couple-percent difference that only really shows up when you have many syscalls or a large number of tasks. It just so happens that the "large number of tasks" case is the one SMT helps, and I'd hope helps more than the overhead of the additional locking. If you're making many syscalls, you have other problems. :-)

You may think it's not that useful....

Anonymous
on
September 1, 2002 - 4:08pm

But on a 2.4.19 kernel without SMP support, compiling itself takes about 2 mins. On a 2.4.19 kernel with SMP support, compiling itself takes about 45 seconds. I'd say it's pretty damn useful.

I've tested this several times, to be sure of what I was seeing. And the numbers are consistent within 2 seconds over 10 reboots.

Try actually using one of these systems first...before believing everything you read.

*that goes for this post too*

Re: You may think it's not that useful....

on
September 1, 2002 - 5:28pm

That's because it's one of those cases where you have many CPU-intensive processes running simultaneously. You might even see a bigger gain by increasing the number of simultaneous make processes (e.g., make -j 1000 bzImage).

One thing i'm not clear about: were these test run on a single-processor Xeon box, or a real SMP box?

Try to compile it on several

Anonymous
on
September 2, 2002 - 9:55am

Try to compile it on several processes on one processor
and you will not have a 100%+ increase in speed.

You have to benchmark things equaly, or you can get any figures you want.

Re: Its not that useful

Anonymous
on
September 5, 2002 - 7:58am

I have to object.

Ingo did to a compilition to show the effect, didn't he?

Also people with CPU-constrained servers experiencing contiouus load >2
on a dual CPU machine would be happy.

Please remember that not all Servers are memory performance constrained.

Christian

re:which CPUs?

on
August 30, 2002 - 10:56pm

I would think the IBM Power4, with actual multiple cores, would be a good candidate?

A: That's not SMT. B: Th

Anonymous
on
September 5, 2002 - 8:55pm

A: That's not SMT. B: The extra CPU is an error-check CPU only.

HyperThreading-Aware Scheduler - Why ?

Anonymous
on
August 31, 2002 - 1:30am

Citation of Ingo Molnar:
"Otherwise it might happen that one physical CPU runs 2 tasks, while another physical CPU runs no threads. The stock scheduler does not recognize this condition as 'imbalance' - to the scheduler it appears as if the first two CPUs had 1-1 task running, the second two CPUs had 0-0 tasks running. The stock scheduler does not realize that the two logical CPUs belong to the same physical CPU." End of Citation.

In essential, why is this bad ? I have no special competence in this area so I suppose that the reason is that the two running threads affects each others performance more than if they were at separate physical CPUs.
   But shouldn't the solution be more general ? Consider the situation in (not neccessarely far IMO) future that "HT-CPUs" get rid of most of the performance differences (compared with separate physical CPUs), or that other physical solutions make CPUs affect other CPUs in different amounts.
   An example could be where several physical CPUs share the same "chip"/dye, where each has (more or less) each own connection with the "bus/es". And that the motherboard has two or more of these "chips". The problems could be the effects of sharing memory buses, of peripheral buses as the PCI bus, or whatever. (E g each physical chip could have their own peripheral bus (in any sense), etc.)

Just a thought. (And not neccessarely thinking of Beowulf clusters.. ;-))

Thomas Berg

Why ...

Anonymous
on
August 31, 2002 - 11:57pm

Because when there are two processes running on the same physical CPU (1 per logical CPU), the physical CPU still has to share it's resources between the two processes (not sure which resource).

acd2

Why ...

Anonymous
on
September 4, 2002 - 7:29am

I understand that, but my question was: why not use a more genereal approach, as I described.

Thomas

Efficiency

Anonymous
on
September 5, 2002 - 4:20am

The scheduler is called every context switch; the less you have to think about in the scheduler, the fewer instructions involved. Fewer instructions means less time rescheduling, and more time running "useful" code.


Basically it's a tradeoff; a more complex scheduler is more likely to make the "right" decision, but takes longer to do it. On current hardware, it usually works out that a quicker scheduler that isn't as good is more effective (in terms of the amount of work that gets done) than a slower scheduler that gets things right.

Resource sharing

Anonymous
on
December 14, 2003 - 6:44am

What I understand from an article I once read, HT works by announcing two processors, while it actually is one processor. HT isn't real multi-processoring:
Processor 1 runs at a normal speed like it would do at UP time.
'Processor' 2 runs only when it's calculation requires different
caculational units than the other processor is using.

eg.:
Processor 1 is doing floating calculations, at that moment the integer calculation part of the processor is idle, so if at that moment the process on the second processor wants to do some integer calculation than that's possible in parallel of the first process.

But there is a problem when both processes are doing almost identical things, in that case HT doesn't perform very well. So if you have the choice of distributing processes across physical processors than this should have precedence above logical processors so the second process isn't dependant upon the needs of the first process.

(Okay this sucks, hopefully you understand ;)

nuh-uh

Anonymous
on
June 29, 2004 - 8:50am

I don't think that's what's going on.

THe 2 logical processors are sharing the same text segment, meaning that 2 *threads* of a process can run in parrallel. In other words, sendmail, on a HT aware scheduler, can run 2 of it's threads on one physical CPU and get approx. 2x performance.

...On the other hand, if 2 separate processes are scheduled to the same physical CPU, this would be slower than an SMP system, because the scheduler would have to context switch between them, effectively halting the other process. Better that they be assigned to separate physical CPU's so they both run simultaniously.

...Not sure if I explained that clearly, but mail to tewner@jct.ac.il if you:
1. want more info.
2. KNOW I'm wrong.

http://lwn.net/Articles/80911/ explains it all

Same Anonymous (not verified)
on
December 29, 2004 - 4:57am

Check http://lwn.net/Articles/80911/ for an explanation of what's really going on.

HT aware scheduler

Anonymous
on
July 20, 2004 - 1:49pm

Hello,
I am using a Xeon cluster with linux 2.6 which is HT aware. But surprisingly I found that two of my processes (MPI) are scheduled in one processor (two logical CPU with hyperthreading) while the other physical CPU is idle. Apparently they are not maintaining per physical processor process queue, and can not differentiate between logical and phycial processor. I wonder if anyone can help me to find out about the linux 2.6 scheduler.

The following is the output of my simple mpi code with some computation with 10 iteration, I created two processes and used a machine file defining one node with two processor. There was no communication between processes and two of my processes took 99.9% of CPU (possibly logical)
I was expecting almost same amount of time for each iteration, but I found different result.

while top in enode9 found to be
16:07:04 up 44 days, 10:58, 1 user, load average: 1.94, 1.56, 0.91
70 processes: 67 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 50.0% user, 0.1% system, 0.0% nice, 49.9% idle
Mem: 515820K total, 291316K used, 224504K free, 63732K buffers
Swap: 2000052K total, 0K used, 2000052K free, 133196K cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
15361 arefeen 25 0 5168 2720 4532 R 99.9 0.5 3:04 2po
15364 arefeen 25 0 5168 2708 4532 R 99.9 0.5 3:04 2po
15393 arefeen 16 0 2052 1052 1840 R 0.3 0.2 0:00 top
419 root 16 0 1624 640 1460 S 0.1 0.1 0:35 automount
1 root 16 0 1520 516 1368 S 0.0 0.1 0:05 init
2 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration/0
3 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd/0
4 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration/1
5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd/1
6 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration/2
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd/2
8 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration/3
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd/3
10 root 5 -10 0 0 0 SW< 0.0 0.0 0:00 events/0

arefeen@emaster:~/test1$ /opt/mpi/bin/mpirun -np 2 -machinefile 2p 2po
I am process 0 of job1 and time= 20.927634
I am process 1 of job1 and time= 21.173508
I am process 0 of job1 and time= 40.892489
I am process 1 of job1 and time= 41.205131
I am process 0 of job1 and time= 42.390944
I am process 1 of job1 and time= 42.380065
I am process 0 of job1 and time= 42.005933
I am process 1 of job1 and time= 41.755328
I am process 0 of job1 and time= 23.445549
I am process 1 of job1 and time= 23.238774
I am process 1 of job1 and time= 21.482233
I am process 0 of job1 and time= 21.715304
I am process 1 of job1 and time= 21.543872
I am process 0 of job1 and time= 21.771917
I am process 1 of job1 and time= 21.625980
I am process 0 of job1 and time= 21.848045
I am process 1 of job1 and time= 40.789668
I am process 0 of job1 and time= 41.502472
I am process 1 of job1 and time= 42.381291
I am process 1 of job1 and average time= 31.757585
I am process 0 of job1 and time= 41.855390
I am process 0 of job1 and average time= 31.835568
arefeen@emaster:~/test1$

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.