From: Paul Jackson <pj@sgi.com>
Add a new per-cpuset flag called 'sched_load_balance'.
When enabled in a cpuset (the default value) it tells the kernel
scheduler that the scheduler should provide the normal load
balancing on the CPUs in that cpuset, sometimes moving tasks
from one CPU to a second CPU if the second CPU is less loaded
and if that task is allowed to run there.When disabled (write "0" to the file) then it tells the kernel
scheduler that load balancing is not required for the CPUs in
that cpuset.Now even if this flag is disabled for some cpuset, the kernel
may still have to load balance some or all the CPUs in that
cpuset, if some overlapping cpuset has its sched_load_balance
flag enabled.If there are some CPUs that are not in any cpuset whose
sched_load_balance flag is enabled, the kernel scheduler will
not load balance tasks to those CPUs.Moreover the kernel will partition the 'sched domains'
(non-overlapping sets of CPUs over which load balancing is
attempted) into the finest granularity partition that it can
find, while still keeping any two CPUs that are in the same
shed_load_balance enabled cpuset in the same element of the
partition.This serves two purposes:
1) It provides a mechanism for real time isolation of some CPUs, and
2) it can be used to improve performance on systems with many CPUs
by supporting configurations in which load balancing is not done
across all CPUs at once, but rather only done in several smaller
disjoint sets of CPUs.This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusetsSee further the Documentation and comments in the code itself.
Acked-by: Paul Jackson <pj@sgi.com>
---
Andrew - this patch goes right after your *-mm patch:
task-containers-enable-containers-by-default-in-some-configs.patch
and before "add-containerstats-v3.patch"Documentation/c...
That's not kernel style. Use either (Andrew would say the second one):
q = csa = doms = NULL;
or
q = NULL;
csa = NULL;---
~Randy
-
Yup - I should have written this line as:
It makes no difference to the code generated. I tend to leave
out 'compiler optimization' hint words if I don't need them to
get the compiler to optimize. In this case, of a single useYou're right - and Andrew would be right as well, since the form:
q = csa = doms = NULL;
generates a compiler warning, as not all three pointers are the
same type.Yup - you're right - about the 'csa' check.
However the if(q ...) check is needed, because I have another bug
here. I allocated 'q' using kfifo_alloc(), so must free using
kfifo_free (or else leak the kfifo buffer memory.) Calls to
kfifo_free() have to guard against NULL pointers before the call.Thanks, Randy!
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
i like this, this feature would be quite useful for -rt and CPU
shielding.( a cpuset is a mandatory container for set_cpus_allowed(), so there is
a material and app-visible difference between a 4-CPU cpuset that has
balancing disabled and 4x 1-CPU cpusets. )Ingo
-
I don't like adding these funny special case sort of things like this.
The user should just be able to specify exactly the partitioning of
tasks required, and cpusets should ask the scheduler to do the best
job of load balancing possible.I implemented that with my patches to do automatic discovery
of the largest set of disjoint cpusets.From there, the problem that cpusets has, is that it lacks a good
way to specify that the machine should be partitioned (IIRC because
stuff defaults to going into the root cpuset which covers all CPUs?).Instead of adding these "I want the scheduler to do something a bit
vague but hopefully good albeit with some downsides" flags, there
should be a way to say "I want to partition the CPUs like so...", IMO.Barring that (ie. maybe you always want a root cpuset to cover all
CPUs), then maybe we should retain the spanning sched domains
in order to balance the root cpuset, and add another set of domains
according to cpuset partitioning. This could be entirely transparent
to userspace, I think (using my patch).Just my opinion. Good to see more thought going into this area, because
it is something that sched-domains can do really well but is underused.
-
If the cpusets which have 'sched_load_balance' enabled are disjoint
(their 'cpus' cpus_allowed masks don't overlap) then you get exactly
what you're asking for. In that case there is exactly one sched domain
for the 'cpus' allowed by each cpuset that has sched_load_balanced
enabled.But there is another case in which one does not want what you ask for.
That case involves the situation where one is running a third part
batch scheduler on part of ones big system, and doing other stuff
(perhaps Ingo's realtime stuff) on another part of the system.In that case, the system admin will be advised to turn off
sched_load_balance on the top cpuset. But in that case the system
admin will -not- know from moment to moment what jobs the batch
scheduler is running on the cpus assigned to its control. Only the
batch scheduler knows that.The batch scheduler is code that was written by someone else, in
some other company, some other time. That code does not get to
control the overall sched domain partitioning of the entire system.
The batch scheduler gets to say, in affect:Here's where I need load balancing to occur, in the normal fashion,
and here's where I don't need it.In short, you insisting that only a single administrative point of
control determine the systems sched domains. Sometimes that fits
the way the system is managed, and my patch lets you do that. But
sometimes this is a shared responsibility, between a piece of third
party software and the system admin, and my patch allows for that
case as well.This is a typical sort of situation that arises from having hierarchical
cpuset definitions, and highlights the reason (and the use case,
involving third party batch schedulers) that I went with a hierarchical
cpuset architecture in the first place.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
But you could do that just by having the current cpuset scheme able
to properly partition the system. You can't (easily) do this now because
you have so many tasks in the root cpuset that it is impossible to know
whether or not you actually want to load balance them.You would do this by creating partitioning cpusets which carve up the
In this case the admin would simply not partition the system (they
would retain a single root cpuset).Neither approach is really fundamentally more or less powerful than
the other, but what I object to in yours is adding these flags which
don't allow the admin to specify what they want, but to specify how they
want it done.Moreover, sched_load_balance doesn't really sound like a good name
for asking for a partition. It's more like you're just asking to have better
load balancing over that set, which you could equally achieve by adding
a second set of sched domains (and the global domains could keep
globally balancing).Basically: the admin doesn't know best when it comes to how the
scheduler should work; the admin knows best about how they intendRather than require the admin to know the intricate details about
how and why the scheduler load balancing gets broken, and when they
might or might not need to use this flag, they can just specify what theyNo, I'm insisting that *no* single administrative point of control
determines the sched domains. Not directly. The kernel should.
cpusets API should be rich enough that the kernel can derive tihs
information from what the admin has intended.
-
Hmmm ... this could be the key to this discussion.
Nick - can two sched domains overlap? And if they do, what does that
mean on any user or application behaviour.From the cpuset side - this patch handles overlap by joining the 'cpus'
into one sched domain. If two cpusets with overlapping 'cpus' are both
marked 'sched_load_balance', then this patch forms a single, combined
sched domain.As best as I can tell, you and I are actually in agreement in the
case that there is no overlap. If the several cpusets which have
'sched_load_balance' enabled have mutually disjoint 'cpus' (no
overlap), then my patch forms exactly one sched domain for each such
cpuset, having the same 'cpus'.The issue is the overlapping cases - are overlapping sched domains
allowed, and if so, how do they affect user space?--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Yes, sched domains can be completely arbitrary, and of course in the
current kernel, parent domains always overlap their children.A sched domain usually means that the scheduler can move tasks
around among that group of CPUs, given the correct flags (but if
there are no flags, then it would be a superfluous domain and should
get trimmed away I think).BTW. as far as the sched.c changes in your patch go, I much prefer
the partition_sched_domains API: http://lkml.org/lkml/2006/10/19/85The caller should manage everything itself, rather than
OK, I don't think your patch actually does the wrong thing
technically (although admittedly your rebuild_sched_domainsFor hard partitions, you don't want them of course. And I think
we should come up with a cpusets solution for that first.Afterwards, overlapping sched domains are allowed and could be
used to make balancing more efficient (rather than any real
affect on userspace). At the moment, the domain builder probably
wouldn't cope very well, though.
-
Please take a closer look at my partition_sched_domains() and its
interface to the scheduler.You should recognize this API, once you look at it. It simply passes
the full flat, hard partition, in its entirety. This is the
partitioning that you speak of, I believe. It's here; just not where
you expected it.The portion of the code that is in kernel/sched.c is just a little bit
of optimization. It avoids rebuilding all the sched domains and
reattaching every task to its sched domain; rather it determines which
sched domains were added or removed and just rebuilds them.Once you take a closer look, I hope you will agree that this new
interface between the cpuset and sched code provides a cleaner
separation.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
I don't know what you think I said that is incorrect and requires me to
look at again. I don't like your partition_sched_domains API because
of the allocation thing. So I prefer the existing (or better, the simplified
version in my patch referenced).The caller should determine which domain to rebuild and reattach.
Simple.
-
i've merged your patch to my scheduler queue - see the patch below. (And
could you send me your SoB line too?) Paul, if we went with the patch
below, what else would be needed for your purposes?Ingo
--------------------------------->
Subject: sched: fix sched-domains partitioning by cpusets
From: Nick Piggin <nickpiggin@yahoo.com.au>Fix sched-domains partitioning by cpusets. Walk the whole cpusets tree after
something interesting changes, and recreate all partitions.Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/cpuset.h | 2
include/linux/sched.h | 3 -
kernel/cpuset.c | 109 ++++++++++++++++++++++---------------------------
kernel/sched.c | 31 +++++++------
4 files changed, 70 insertions(+), 75 deletions(-)Index: linux/include/linux/cpuset.h
===================================================================
--- linux.orig/include/linux/cpuset.h
+++ linux/include/linux/cpuset.h
@@ -14,6 +14,8 @@#ifdef CONFIG_CPUSETS
+extern int cpuset_hotplug_update_sched_domains(void);
+
extern int number_of_cpusets; /* How many cpusets are defined in system? */extern int cpuset_init_early(void);
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -798,8 +798,7 @@ struct sched_domain {
#endif
};-extern int partition_sched_domains(cpumask_t *partition1,
- cpumask_t *partition2);
+extern int partition_sched_domains(cpumask_t *partition);#endif /* CONFIG_SMP */
Index: linux/kernel/cpuset.c
===================================================================
--- linux.orig/kernel/cpuset.c
+++ linux/kernel/cpuset.c
@@ -752,6 +752,24 @@ static int validate_change(const struct
return 0;
}+static void update_cpu_domains_children(struct cpuset *par,
+ cpumask_t *non_partitioned)
+{
+ struct cpuset *c;
+
+ list_for_each_entry(c, &par->chil...
Nick and I already resolved that, when he first posted this patch
in October of 2006. The cpu_exclusive flag doesn't work for this.Here's a copy of the key message, from Nick, near the end of that
thread in which he earlier proposed this patch, also available at:
http://lkml.org/lkml/2006/10/21/12====================================================
Paul Jackson wrote:
> Nick wrote:
>
>>Or, another question, how does my patch hijack cpus_allowed? In
>>what way does it change the semantics of cpus_allowed?
>
>
> It limits load balancing for tasks in cpusets containing
> a superset of that cpusets cpus.
>
> There are always such cpusets - the top cpuset if no other.Ah OK, and there is my misunderstanding with cpusets. From the
documentation it appears as though cpu_exclusive cpusets are
made in order to do the partitioning thing.If you always have other domains overlapping them (regardless
that it is a parent), then what actual use does cpu_exclusive
flag have?
====================================================I agree with Nick on this conclusion, and with his other conclusion
that the 'cpu_exclusive' flag is pretty near useless.Some per-cpuset flag other the 'cpu_exclusive' is required to
control sched domains from cpusets.This has specific impact on one of the key users of cpusets, the
various developers of batch schedulers. One by one, they have
determined that the cpu_exclusive flag is incompatible with the
way they set up cpusets, and have decided they should not enable
that flag on any cpuset under their control. It gets in their way,
and serves no useful purpose for them. However we need someway
for them to specify where they need load balancing, so that on
large systems, they can allow the admin to avoid the cost of load
balancing over the batch schedulers entire subset of the system at
once, but rather just load balance over the small...
Sorry for the confusion: I only meant the sched.c part of that
patch, not the full thing.
-
Ah - ok. We're getting closer then. Good.
Let me be sure I've got this right then.
You prefer the interface from your proposed patch, by which the
cpuset code passes sched domain requests to the scheduler code a single
cpumask that will define a sched domain:int partition_sched_domains(cpumask_t *partition)
and I am suggesting instead a new and different interface:
void partition_sched_domains(int ndoms_new, cpumask_t *doms_new)
In the first API, one cpumask is passed in, and a single sched
domain is formed, taking those CPUs from any sched domain they
might have already been a member of, into this new sched domain.In the second API, the entire flat partitioning is passed in,
giving an array of masks, one mask for each desired sched domain.
The passed in masks do not overlap, but might not cover all CPUs.Question -- how does one turn off load balancing on some CPUs
using the first API?Does one do this by forming singleton sched domains of one
CPU each? Is there any downside to doing this?The simplest cpuset code to work with this would end up exposing
this method of disabling load balancing to user space, forcing
users to create cpusets with one CPU each to be able do disable
load balancing.However a little bit of additional kernel cpuset code could hide
this detail from user space, by recognizing when the user had
asked to turn off load balancing on some larger cpuset, and by
then calling partition_sched_domains() multiple times, once for
each CPU in that cpuset.There might be an even simpler way. If the kernel/sched.c routines
detach_destroy_domains() and build_sched_domains() were exposed as
external routines, then the cpuset code could call them directly,
removing the partition_sched_domains() routine from sched.c entirely.
Would this be worth persuing?--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jacks...
Yes, and no (it does the same thing as your version internally, so
Yeah: do all that in cpusets. It's already information you would have
to derive in order to make it work properly anyway. If you are not
passing in the singleton domains ATM, then they will not get properlydetach_destroy and build are things that may get reimplemented to
suit different capabilities (eg. we might want to have multiple trees of
domains for each CPU). So I think it is important to expose the simple
partition API which should be unambiguous and stable.It's not a huge deal, but I'd like to keep partition_sched_domains. After
my patch, it's really simple.-
ok
It's a deal.
I've got a couple of brown paper bag bug fixes almost ready to send
out, for the patch I sent Andrew a few days ago:cpuset and sched domains: sched_load_balance flag
I'll send these in, and then get some sleep and code up these changes
to the partition_sched_domains, along the lines you have recommended.Thanks, Nick and Ingo.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
in any case i'd like to see the externally visible API get in foremost -
and there now seems to be agreement about that. (yay!) Any internal
shaping of APIs can be done flexibly between cpusets and the scheduler.Ingo
-
Yup - though Nick and I will have to agree to -some- internal interface
between the cpuset and sched code, at least for the moment.At least, if we thrash about on this, we won't be changing the externally
visible API around. We'll just continue driving Andrew nuts, not our
users - that's an improvement.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
OK look, I don't want to hold up progress. I do like to ask these questions
and be difficult if I think there might be a better way and/or to make sure
you've thought about all the angles. I'm not volunteering to maintain
cpusets and I'm not as close to the customers who care as you.Obviously what your patches do here is a lot closer to "the right thing"
than cpus_exclusive. And the worst problem they'll cause is to add
cruft to cpusets.c.So I'll keep persuing the other subthread for the same reasons, but aside
from implementation nits, I don't know if it is worth holding up a merge.
-
ok. Then lets go back to the original plan: your two patches and the new
flag. Nick?Ingo
-
Yup - it's not a good name for asking for a partition.
That's because it isn't asking for a partition.
Yup - it's asking for load balancing over that set. That is why it is
called that. There's no idea here of better or worse load balancing,
that's an internal kernel scheduler subtlety -- it's just a request that
load balancing be done.That is what is visible to user space: whether or not tasks get moved
from overloaded CPUs to underloaded, though still allowed, CPUs.This is visible to user space in two ways:
1) as task movemement, which may or may not be what is desired, and
2) as kernel CPU cycles spent, because load balancing costs CPU cycles
that increase more than linearly with the number of CPUs being
balanced.The user doesn't give a hoot what a 'sched domain' is. They care to
manage (1) whether their tasks might move under a load imbalance, andYou would do this with the current, single rooted cpuset (and now
cgroup) mechanism by having multiple immediate child cpusets of the
root cpuset, which partition the system CPUs. There is no need toI don't know what proposal you are reacting to here. Clearly not this
patch that I have proposed, as it is trivially easy to indicate whether
you want to load balance the root cpuset - by setting or clearing the
'sched_load_balance' flag in the root cpuset.My approach doesn't do that - perhaps we aren't communicating.
We are in complete agreement that the admin should specify what they
We are in complete agreement in insisting on this.
In short:
The kernel schedulers dynamic sched domains are --not-- the service
being provided to the user. "Sched domains" are just the kernel
internal mechanism.The service being provided is dynamic load balancing of tasks from
overloaded CPUs to underloaded CPUs.Some users will want to disable load balancing on some cpusets, because
either:
(1) it's too expensive to balance really large cpusets unless really
...
Yeah yeah OK, you turn it off in the parent cpuset of the child cpusets
which you want the partitioning to occur in, and ensure there are no
other overlapping cpusets with that flag turned on in order to create aOK, if it prohibits balancing when sched_load_balance is 0, then it is
Yeah, but the interface is not very nice. As an interface for hard
What do you mean by bastardized? What's wrong with having a real
Not your proposal, just the idea to have enough information to be able
to work out a more optimal set of sched-domains automatically. Actually
we can do most of it already automatically, but not hard partitioning.[snip]
As I said, neither is really semantically more powerful than the other. So
yeah those things are possible to do with your API, but I don't like the API.
-
It doesn't prohibit load balancing just because sched_load_balance is 0.
Only if there are no overlapping cpusets still needing balancing does itYeah -- cpusets are hierarchical. And some of the use cases for
Changing cpusets from single root to multiple roots would be
bastardizing it.My proposed sched_load_balance API is already quite capable of
representing what you see the need for - hard partitioning. It is also
quite capable of representing some other situations, such as I've
described in other replies, that you don't seem to see the need for.To repeat myself, in some cases, such as batch schedulers running in a
subset of the CPUs on a large system, the code that knows some of the
needs for load balancing does not have system wide control to mandate
hard partitioning. The batch scheduler can state where it is depending
on load balancing being present, and the system administrator can choose
or not to turn off load balancing in the top cpuset, thereby granting or
not control over load balancing on the CPUs controlled by the batch
scheduler to the batch scheduler.Hard partitioning is not the only use case here.
If you don't appreciate the other cases, then fine ... but I don't think
that gives you grounds to reject a patch just because it is not preciselyWhat's wrong with it is that 1) it doesn't cover all the use cases,
2) it would require a new and different mechanism other than cpusets
which are not multiple rooted, and do robustly support overlapping
sets and hence are not a hard partitioning, and 3) we'd still need
the cpuset based API to cover the remaining use cases.Good grief -- I must be misunderstanding you here, Nick. I can't
imagine that you want to turn cpusets into a multiple rooted hardI can't figure out what the sentence is saying ... sorry.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Yeah, that's what I mean. Important point: prohibits, rather than
Well OK, if that's your definition. Not very helpful though.
Can I win this argument by defining sched_load_balance
What happens when you partition the system with your approach, and
you get kernel threads being spawned into the root cpuset and getting
unbalanced?I just still don't understand how those cases work... if you can spell it
The above sentence was explaining which proposal I was talking
about.
-
Yup. We've got a square peg and a round hole. An impedance mismatch.
That's the root cause of this entire wibbling session, in my view.The essential role of cpusets, cgroups and much other such work of
recent, in my view, is pounding this square peg into that round hole.
In essence, it is fitting the hierarchical structure of the
organizations (corporations, universities and governments) who own big
systems to the flat, system-wide mandates needed to manage a givenWell, such a change would be rather substantial and undesired,
If I understand your approach to the kernel-to-user interface correctly
(sometimes I doubt I do) then your approach expected some user space code
or person or semi-intelligent equivalent to define a flat partition,
which will then be used to determine the sched domains.In the batch scheduler case, running on a large shared system used
perhaps by several departments, no one entity can do that. One person,
perhaps the system admin, knows if they want to give complete control
of some big chunk of CPUs to a batch scheduler. The batch scheduler,
written by someone else far away and long ago, knows which jobs are
actively running on which subsets of the CPUs the batch scheduler is
using.There is no single monolithic entity on such systems who knows all and
can dictate all details of a single, flat, system-wide partitioning.The partitioning has to be sythesized from the combined requests of
several user space entities. That's ok -- this is bread and butter
work for cpusets.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
OK, so I don't exactly understand you either. To make it simple, can
you give a concrete example of a cpuset hierarchy that wouldn'tOK, so to really do anything different (from a non-partitioned setup),
you would need to set sched_load_balance=0 for the root cpuset?
Suppose you do that to hard partition the machine, what happens to
newly created tasks like kernel threads or things that aren't in a
cpuset?
-
It's more a matter of knowing how my third party batch scheduler
coders think. They will be off in some corner of their code with a
cpuset in hand that they know is just being used to hold inactive
(paused) tasks, and they can likely be persuaded to mark those cpusets
as not being in need of any wasted CPU cycles load balancing them.But these inactive cpusets will overlap in unknown (to them at
the time, in that piece of code) ways with other cpusets holding
active jobs, and there is no chance, unless it is a matter of major
performance impact, that they will be in any position to comment on
the proper partitioning of the sched domains on all the CPUs under the
control of their batch scheduler, much less comment on the partitioning
of the rest of the system.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
There won't be any CPU cycles used, if the tasks are paused (surely
-
Consider the case when there are two, smaller, non-overlapping cpusets
with active jobs, and one larger cpuset, covering both those smaller
ones, with only paused tasks.If we realize we don't need to balance the larger cpuset, then we can
have two smaller sched domains rather than one larger one.Since the CPU cycle cost of load balancing increases more than linearly
with the size of the sched domain, therefore it will save CPU cycles to
have the two smaller ones, rather than the one larger one.If user space can just tell us that the larger cpuset doesn't need
balancing, then the kernel has enough information to perform this
optimization.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Yup - exactly. In fact one code fragment in my patch highlights this:
/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(&top_cpuset)) {
ndoms = 1;
doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
*doms = top_cpuset.cpus_allowed;
goto rebuild;
}This code says: if the top cpuset is load balanced, you've got one
big fat sched domain covering all (nonisolated) CPUs - end of story.
None of the other 'sched_load_balance' flags matter in this case.Logically, the above code fragment is not needed. Without it, the
code would still do the same thing, just wasting more CPU cycles doingWell ... --every-- task is in a cpuset, always. Newly created tasks
start in the cpuset of their parent. Grep for 'the_top_cpuset_hack'
in kernel/cpuset.c to see the lengths to which we go to ensure that
current->cpuset always resolves somewhere.The usual case on the big systems that I care about the most is
that we move (almost) every task out of the top cpuset, into smaller
cpusets, because we don't want some random thread intruding on the
CPUs dedicated to a particular job. The only threads left in the root
cpuset are pinned kernel threads, such as for thread migration, per-cpu
irq handlers and various per-cpu and per-node disk and file flushers
and such. These threads aren't going anywhere, regardless. But no
thread that is willing to run anywhere is left free to run anywhere.I will advise my third party batch scheduler developers to turn off
sched_load_balance on their main cpuset, and on any big "holding tank"
cpusets they have which hold only inactive jobs. This way, on big
systems that are managed to optimize for this, the kernel scheduler
won't waste time load balancing the batch schedulers big cpusets that
don't need it. With the 'sched_load_balance' flag defined the way
it is, the batch scheduler won't have to make system-wide decisions
...
These are what I'm worried about, and things like kswapd, pdflush,
could definitely use a huge amount of CPU.If you are interested in hard partitioning the system, you most
definitely want these things to be balanced across the non-isolated
CPUs.-
But these guys are pinned anyway (or else they would already be moved
into a smaller load balanced cpuset), so why waste time load balancing
what can't move?And on some of the systems I care about, we don't want to load balance
these guys; rather we go to great lengths to see that they don't run at
all when we don't want them to.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
They're not pinned (kswapds are pinned to a node, but still). pdflush
is not pinned at all and can be dynamically created and destroyed. Ditto
for kjournald, as well as many others.Basically: it doesn't feel like a satisfactory solution to brush these under
Most smaller realtime partitioned systems will want to, I'd expect.
-
Whatever is not pinned is moved out of the top cpuset, on the kind of
systems I'm most familiar with. They are put in a smaller cpuset, with
load balancing, that is sized for the workload they might present, butWe don't do a whole lot of brushing under the carpet on these kind of
systems. If I gave you the impression we do, then I misled you - sorry.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
So if a new pdflush is spawned, it get's moved to some cpuset? That
probably isn't something these realtime systems want to do (ie. the
non-realtime portion probably doesn't want to have any sort of schedulerNo, not on your systems. I'm worried about the smaller ones that don't
get so much attention (eg. hard partitioning for realtime).
-
No - the new pdflush is put in the same cpuset as its parent, with a
patch that I sent in early this year. See the following code in
mm/pdflush.c:/*
* Some configs put our parent kthread in a limited cpuset,
* which kthread() overrides, forcing cpus_allowed == CPU_MASK_ALL.
* Our needs are more modest - cut back to our cpusets cpus_allowed.
* This is needed as pdflush's are dynamically created and destroyed.
* The boottime pdflush's are easily placed w/o these 2 lines.
*/
cpus_allowed = cpuset_cpus_allowed(current);
set_cpus_allowed(current, cpus_allowed);return __pdflush(&my_work);
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
From: Paul Jackson <pj@sgi.com>
The kernel/cpuset.c code handling the updating of a cpusets
'cpus' and 'mems' masks was starting to look a little bit
crufty to me.So I rewrote it a little bit. Other than subtle improvements
in the consistency of identifying white space at the beginning
and end of passed in masks, I don't see that it makes any
visible difference in behaviour. But it's one or two hundred
kernel text bytes smaller, and to my eye, easier to understand.Signed-off-by: Paul Jackson <pj@sgi.com>
---
Andrew - this patch goes after:
cpuset-and-sched-domains-sched_load_balance-flagkernel/cpuset.c | 50 ++++++++++++++++++++------------------------------
1 file changed, 20 insertions(+), 30 deletions(-)--- 2.6.23-rc8-mm1.orig/kernel/cpuset.c 2007-09-30 01:27:28.442825126 -0700
+++ 2.6.23-rc8-mm1/kernel/cpuset.c 2007-09-30 01:38:22.829256421 -0700
@@ -488,6 +488,14 @@ static int validate_change(const struct
return -EINVAL;
}+ /* Cpusets with tasks can't have empty cpus_allowed or mems_allowed */
+ if (cgroup_task_count(cur->css.cgroup)) {
+ if (cpus_empty(trial->cpus_allowed) ||
+ nodes_empty(trial->mems_allowed)) {
+ return -ENOSPC;
+ }
+ }
+
return 0;
}@@ -691,11 +699,13 @@ static int update_cpumask(struct cpuset
trialcs = *cs;/*
- * We allow a cpuset's cpus_allowed to be empty; if it has attached
- * tasks, we'll catch it later when we validate the change and return
- * -ENOSPC.
+ * An empty cpus_allowed is ok iff there are no tasks in the cpuset.
+ * Since cpulist_parse() fails on an empty mask, we special case
+ * that parsing. The validate_change() call ensures that cpusets
+ * with tasks have cpus.
*/
- if (!buf[0] || (buf[0] == '\n' && !buf[1])) {
+ buf = strstrip(buf);
+ if (!*buf) {
cpus_clear(trialcs.cpus_allowed);
} else {
retval = cpulist_parse(buf, trialcs.cpus_allowed);
@@ -703,10 +713,6 @@ static int update_cpumask(struct cpuset
re...
| Greg KH | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Andrew Morton | -mm merge plans for 2.6.23 |
| Pavel Roskin | ndiswrapper and GPL-only symbols redux |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Paweł Staszewski | Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits |
| David Miller | [GIT]: Networking |
| Herbert Xu | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Stephen Hemminger | Re: HTB accuracy for high speed |
git: | |
| Sander | 'struct task_struct' has no member named 'mems_allowed' (was: Re: 2.6.20-rc4-mm1) |
