Linux: Improving Multiprocessor CPU Scheduling

Submitted by Jeremy
on February 28, 2005 - 6:21am

Nick Piggin [interview] uploaded a series of patches for the 2.6 Linux kernel [forum] CPU scheduler aimed at improving multiprocessor support. Specifically, the patches focus on improving SMT (Symmetric MultiThreading) [story], CMP (Chip MultiProcessing), and NUMA (Non-Uniform Memory Architecture) [story] scheduling behavior.

Still in development, Nick hopes to first get the patches merged into Andrew Morton [interview]'s -mm tree [story] for testing, "they are not going to be very well tuned for most usages at the moment (unfortunately dbt2/3-pgsql on OSDL isn't working, which is a good one). So hopefully I can address regressions as they come up." 2.6 CPU scheduler author Ingo Molnar [interview] reviewed the patches, logically dividing them into three groups and agreeing that some of them should be merged into the -mm tree for testing, while others were ready for inclusion in Linus' main -bk tree.


From: Nick Piggin [email blocked]
To: Andrew Morton [email blocked]
Subject: [PATCH 0/13] Multiprocessor CPU scheduler patches
Date: 	Thu, 24 Feb 2005 18:14:53 +1100

Hi,

I hope that you can include the following set of CPU scheduler
patches in -mm soon, if you have no other significant performance
work going on.

There are some fairly significant changes, with a few basic aims:
* Improve SMT behaviour
* Improve CMP behaviour, CMP/NUMA scheduling (ie. Opteron)
* Reduce task movement, esp over NUMA nodes.

They are not going to be very well tuned for most usages at the
moment (unfortunately dbt2/3-pgsql on OSDL isn't working, which
is a good one). So hopefully I can address regressions as they
come up.

There are a few problems with the scheduler currently:

Problem #1:
It has _very_ aggressive idle CPU pulling. Not only does it not
really obey imbalances, it is also wrong for eg. an SMT CPU
who's sibling is not idle. The reason this was done really is to
bring down idle time on some workloads (dbt2-pgsql, other
database stuff).

So I address this in the following ways; reduce special casing
for idle balancing, revert some of the recent moves toward even
more aggressive balancing.

Then provide a range of averaging levels for CPU "load averages",
and we choose which to use in which situation on a sched-domain
basis. This allows idle balancing to use a more instantaneous value
for calculating load, so idle CPUs need not wait many timer ticks
for the load averages to catch up. This can hopefully solve our
idle time problems.

Also, further moderate "affine wakeups", which can tend to move
most tasks to one CPU on some workloads and cause idle problems.

Problem #2:
The second problem is that balance-on-exec is not sched-domains
aware. This means it will tend to (for example) fill up two cores
of a CPU on one socket, then fill up two cores on the next socket,
etc. What we want is to try to spread load evenly across memory
controllers.

So make that sched-domains aware following the same pattern as
find_busiest_group / find_busiest_queue.

Problem #3:
Lastly, implement balance-on-fork/clone again. I have come to the
realisation that for NUMA, this is probably the best solution.
Run-cloned-child-last has run out of steam on CMP systems. What
it was supposed to do was provide a period where the child could
be pulled to another CPU before it starts running and allocating
memory. Unfortunately on CMP systems, this tends to just be to the
other sibling.

Also, having such a difference between thread and process creation
was not really ideal, so we balance on all types of fork/clone.
This really helps some things (like STREAM) on CMP Opterons, but
also hurts others, so naturally it is settable per-domain.

Problem #4:
Sched domains isn't very useful to me in its current form. Bring
it up to date with what I've been using. I don't think anyone other
than myself uses it so that should be OK.

Nick

[patches]


From: Ingo Molnar [email blocked] Subject: Re: [PATCH 1/13] timestamp fixes Date: Thu, 24 Feb 2005 09:34:41 +0100 * Nick Piggin [email blocked] wrote: > On Thu, 2005-02-24 at 08:46 +0100, Ingo Molnar wrote: > > * Nick Piggin [email blocked] wrote: > > > > > 1/13 > > > > > > > ugh, has this been tested? It needs the patch below. > > > > Yes. Which might also explain why I didn't see -ve intervals :( Thanks > Ingo. > > In the context of the whole patchset, testing has mainly been based > around multiprocessor behaviour so this doesn't invalidate that. nono, by 'this' i only meant that patch. The other ones look mainly OK, but obviously they need a _ton_ of testing. these: [PATCH 1/13] timestamp fixes (+fix) [PATCH 2/13] improve pinned task handling [PATCH 3/13] rework schedstats can go into BK right after 2.6.11 is released as they are fixes or norisk-improvements. [lets call them 'group A'] These three: [PATCH 4/13] find_busiest_group fixlets [PATCH 5/13] find_busiest_group cleanup [PATCH 7/13] better active balancing heuristic look pretty fine too and i'd suggest early BK integration too - but in theory they could impact things negatively so that's where immediate BK integration has to stop in the first phase, to get some feedback. [lets call them 'group B'] these: [PATCH 6/13] no aggressive idle balancing [PATCH 8/13] generalised CPU load averaging [PATCH 9/13] less affine wakups [PATCH 10/13] remove aggressive idle balancing [PATCH 11/13] sched-domains aware balance-on-fork [PATCH 12/13] schedstats additions for sched-balance-fork [PATCH 13/13] basic tuning change things radically, and i'm uneasy about them even in the 2.6.12 timeframe. [lets call them 'group C'] I'd suggest we give them a go in -mm and see how things go, so all of them get: Acked-by: Ingo Molnar [email blocked] If things dont stabilize quickly then we need to do it piecemail wise. The only possible natural split seems to be to go for the running-task balancing changes first: [PATCH 6/13] no aggressive idle balancing [PATCH 8/13] generalised CPU load averaging [PATCH 9/13] less affine wakups [PATCH 10/13] remove aggressive idle balancing [PATCH 13/13] basic tuning perhaps #8 and relevant portions of #13 could be moved from group C into group B and thus hit BK early, but that would need remerging. and then for the fork/clone-balancing changes: [PATCH 11/13] sched-domains aware balance-on-fork [PATCH 12/13] schedstats additions for sched-balance-fork a more finegrained splitup doesnt make much sense, as these groups are pretty compact conceptually. But i expect fork/clone balancing to be almost certainly a problem. (We didnt get it right for all workloads in 2.6.7, and i think it cannot be gotten right currently either, without userspace API help - but i'd be happy to be proven wrong.) (if you agree with my generic analysis then when you regenerate your patches next time please reorder them according to the flow above, and please try to insert future fixlets not end-of-stream but according to the conceptual grouping.) Ingo

Related Links:

already in mm

Anonymous (not verified)
on
March 1, 2005 - 2:59am

It has been merged in 2.6.11-rc5-mm1

Now if only 2.6.12 was out with the new pcmcia-stuff :-)

Isn't CMP short for chip mult

Anonymous (not verified)
on
March 1, 2005 - 7:10am

Isn't CMP short for chip multiprocessing, instead of "cellular multiprocessing" (whatever that means)? I.e. true multi-core chips and such instead of simultaneous multithreading, where execution and bus resources are shared.

re: Isn't CMP short

Jeremy
on
March 1, 2005 - 7:20am

Yes, I suppose that makes sense.

FWIW: Chip multiprocessing versus cellular multiprocessing.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.