(Aside to the RealTime folks -- is there a 'realtime'
email list which I should include in this discussion?)The kernel has a "isolcpus=" kernel boot time parameter. This
parameter isolates CPUs from scheduler load balancing, minimizing the
impact of scheduler latencies on realtime tasks running on those CPUs.Questions:
==========Do you, or someone you know, use "isolcpus="?
Can we remove it?
Should we remove it?
Should we first deprecate it somehow, for a while, before
removing it?Background:
===========In July 2004, Dimitri Sivanich <sivanich@sgi.com> proposed
"isolcpus=" for realtime isolation of CPUs from the scheduler
(http://lkml.org/lkml/2004/7/22/97).Ingo said of it "looks good", and Nick said "Cool."
It appeared in 2.6.9 Linux kernels.
It made Item #6 of Zack Brown's Kernel Traffic #272, dated Sept
5, 2004.It also made LWN.net Weekly Edition for October 28, 2004, at
http://lwn.net/Articles/107490/bigpage.Dimitri's fifteen minutes of fame had begun ;).
In April 2005, Dinakar Guniguntala <dino@in.ibm.com> proposed
dynamic scheduler domains (http://lkml.org/lkml/2005/4/18/187).It was immediately recognized, by Nick that this new work was a
"complete superset of the isolcpus= functionality."Dinakar concurred, responding that he "was hoping that by the
time we are done with this, we would be able to completely get
rid of the isolcpus= option."To which I (pj) replied "I won't miss it. Though, since it's
in the main line kernel, do you need to mark it deprecated for
a while first?"Since then, dynamic scheduler domains and cpusets have seen much
work. See for example http://lkml.org/lkml/2007/9/30/29, which
added the sched_load_balance flag to cpusets.However nothing much has changed with regard to the "isolcpus=" kernel
boot time parameter. This parameter is still there. In October of
2006, Derek Fults <dfults@sgi.c...
Hi Paul,
in short: NAK!
I used it to mask out a defect CPU on a 8-CPU node of a
HPC-cluster at a customer site, until the $BIG_VENDOR
sent a replacement. And to prove $BIG_VENDOR, that we actually
have a problem on THAT CPU.So I would really like to keep this fault isolation capability.
I made my customer happy with that.I wish Linux had more such "mask out bad hardware" features
to faciliate fault isolation and boot and runtime.Best Regards
Ingo Oeser
--
Yeah - except that its not meant to be used as such - it will still
brings the cpu up, and it is still usable for the OS.So sorry, your abuse doesn't make for a case to keep this abomination.
--
How come it is an abonination? It is an easy way to do what it does,
and it's actually not a bad thing for some uses not to have to use
cpusets.Given that it's all __init code anyway, is there a real reason _to_
remove it?
--
IMHO,
What is an abonination, is that cpusets are equired for this type of
isolation to begin with, even on a 2 processor machine.I would like the option to stay and be extended like Max originally
proposed. If cpusets/hotplug are configured isolation would be obtained
using them. If not then isolcpus could be used to get the same isolation.From a user land point of view, I just want an easy way to fully
isolate a particular cpu. Even a new syscall or extension to
sched_setaffinity would make me happy. Cpusets and hotplug don't.Again this is just MHO.
Regards
Mark--
Mark, I used to be the same way and I'm a convert now. It does seems like an
overkill for 2cpu machine to have cpusets and cpu hotplug. But both options
cost around 50KB worth of text and maybe another 10KB of data. That's on the
x86-64 box. Let's say it's a 100KB. Not a terribly huge overhead.Now if you think about it. In order to be able to dynamically isolate a cpu we
have to do exact same thing that CPU hotplug does. Which is to clear all
timers, kernel, threads, etc from that CPUs. It does not make sense to
implement a separate logic for that. You could argue that you do not need
dynamic isolation but it's too inflexible in general even on 2way machines
it's waste to not be able to use second cpu for general load even when RT app
is not running. Given that CPU hotplug is necessary for many things, including
suspend on multi-cpu machines it's practically guaranteed to be very stable
and well supported. In other words we have a perfect synergy here :).Now, about the cpusets. You do not really have to do anything fancy with them.
If all you want to do is to disable systemwide load balancing
mount -tcgroup -o cpuset cpuset /dev/cpuset
echo 0 > /dev/cpuset/cpuset.sched_load_banaceThat's it. You get _exactly_ the same effect as with isolcpus=. And you can
change that dynamically, and when you switch to quad- and eight- core machines
then you'll be to do that with groups of cpus, not just system wide.Just to complete the example above. Lets say you want to isolate cpu2
(assuming that cpusets are already mounted).# Bring cpu2 offline
echo 0 > /sys/devices/system/cpu/cpu2/online# Disable system wide load balancing
echo 0 > /dev/cpuset/cpuset.sched_load_banace# Bring cpu2 online
echo 1 > /sys/devices/system/cpu/cpu2/onlineNow if you want to un-isolate cpu2 you do
# Disable system wide load balancing
echo 1 > /dev/cpuset/cpuset.sched_load_banaceOf course this is not a complete isolation. There are also irqs (see my
"default irq...
Furthermore, cpusets allow for isolated but load-balanced RT domains. We
now have a reasonably strong RT balancer, and I'm looking at
implementing a full partitioned EDF scheduler somewhere in the future.This could never be done using isolcpus.
--
Thanks for the detailed tutorial Max. I'm personally still very
skeptical. I really don't believe you'll ever be able to run multiple
_demanding_ RT environments on the same machine. Now matter how many
processors you've got. But even though I might be wrong there, thats
actually OK with me. I, and I'm sure most, don't have a problem with
dedicating a machine to a single RT env.You've got to hold your tongue just right, look at the right spot on the
wall, and be running the RT patched kernel, all at the same time, to run
just one successfully. I just want to stop using my tongue and staring
at the wall. I personally feel that a single easy method of completely
isolating a single processor from the rest of the machine _might_
benefit the RT community more than all this fancy stuff coming down the
pipe. Something like your original proposed isolcpus or even a simpleI'm sure my thoughts reflect a gross under estimate of what really has
to happen. I will hope for the best and wait.Regards
Mark--
I understand your scepticism but it's quite easy to do these days. Yes there
are certain restrictions on how RT applications have to be designed, but
definitely not a rocket science. It can be summed up in a few words:
"cpu isolation, lock-free communication and memory management,
and direct HW access"
In other words you want to talk using lock-free queues and mempools between
soft- and hard- RT components and use something like libe1000.sf.net to talk
to the outside world.
There are other approaches of course, those involve RT kernels, Xenomai, etc.As I mentioned awhile ago we (here at Qualcomm) actually implemented full
blown UMB (one of the 4G broadband technologies) basestation that runs entire
MAC and part of PHY layers in the user-space using CPU isolation techniques.
Vanilla 2.6.17 to .24 kernel + cpuisol and off-the-shelf dual-Opteron and
Core2Duo based machines. We have very very tight deadlines and yet everything
works just fine. And no we don't have to do any special tong holding or other
rituals :) for it to work. In fact quite the opposite. I can do full SW
(kernel, etc) builds and do just about anything else while our basestation
application is running. Worst case latencies in the RT thread running on the
isolated CPU is about ~1.5usec.Now I switched to 8way Core2Quad machines. I can run 7 RT engines on 7
isolated CPUs and load cpu0. Latencies are a bit higher 5-6 usec (I guessing
due to shared caches and stuff) but otherwise it works fine. This is with the
2.6.25.4-cpuisol2 and syspart (syspart is a set of scripts for setting up
system partitions). I'll release both either later today or early next week.Yes it may seem that way. But as I explained in the previous email. In order
to actually implement something like that we'd need to do reimplement parts of
the cpusets and cpu hotplug. I'm not sure if you noticed or not but my
original patch actually relied on the cpu hotplug anyway. Just because it
makes no sense not to awesome p...
[ sorry if this is going OT ]
I'm working on a partitioned EDF scheduler right now, and I have to
face several issues, starting from the interface to use to expose the
EDF scheduler to userspace, and the integration with the existing
sched_rt policy.By now I'm experimenting with an additional sched_class that implements
a SCHED_EDF policy, extending the POSIX struct sched_param with the
EDF parameters of the task, do you see any better way to do that?
Could that approach be reasonable?Michael
___________________________________
Scopri il Blog di Yahoo! Mail: trucchi, novità, consigli... e la tua opinione!
http://www.ymailblogit.com/blog/
--
I would add a sched_class above sched_rt and let sched_rt run in all
unclaimed time by sched_edf.Have you looked at deadline inheritance to replace PI? I think it can be
Yes, that is the way I'm leaning.
--
I add this type of class before sched_rt, so the next of sched_edf
I think it can be done with an rb tree. The only tricky
By now I'm facing some problems. I still have not clear what parameters a task forked from a sched_edf task should get, as it would involve some form of admission control, and how to deal with tasks that run longer than their nominal execution time (i.e., should we use some server mechanism to limit the amount of cpu they're using, or handle that in some other way?)Michael
___________________________________
Scopri il Blog di Yahoo! Mail: trucchi, novità, consigli... e la tua opinione!
http://www.ymailblogit.com/blog/
--
Mapping them onto U64_MAX - prio or something like that ought to do.
Handling wraparound of the timeline might get a little involved though -
I'd start with something like:
u64 sched_param::edf_period [ns]
u64 sched_param::edf_runtime [ns]so that deadline = time_of_schedule + edf_period, and his allowance
within that period is edf_runtime.fork would inherit the parent's settings, and we'd need to do admission
control on all tasks entering SCHED_EDF, either through setscheduler()Yeah - we already account the amount of runtime, we can send them
SIGXCPU and stop running them. Look at the rt_bandwidth code upstream -
it basically stops rt task groups from running once their quota is
depleted - waking them up once it gets refreshed due to the period
expiring.For single tasks its easier, just account their time and dequeue them
once it exceeds the quota, and enqueue them on a refresh timer thingy to
start them again once the period rolls over.The only tricky bit here is PI :-) it would need to keep running despite
being over quota.--
This is what I'm doing right now (apart from using timespec structs
instead of u64 values to align the sched_param struct specified by
POSIX on systems with SCHED_SPORADIC support).Ok, using the same mechanism even for SCHED_EDF tasks seems the
There is some work in this area, and there are some protocols
handling that, but that simple solution will be a good starting
point.
--
Just to be sure I'm following you here, you stating that you
want to be able to manipulate the isolated cpu map at runtime,
not just with the boot option isolcpus, right? Where this
isolated cpu map works just fine even on systems which do
not have cpusets configured, right?--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Yes to both questions. However after reading Max and Peter's response, I
guess there is another, probably better or _only_, way to get what I really
need anyway so please don't consider my intrusion into this thread as a NAK.I do not rely on this option as it is implemented.
Regards
Mark--
Thank-you for your clearly stated conclusions, and thanks for stopping by.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Ingo, I just wanted to elaborate on what Peter is saying. That CPU will still
have to be _booted_ properly. It may be used for hard- and soft- interrupt
processing, workqueues (internal kernel queuing mechanism) and kernel timers.In your particular case you're much much much better off with doing
echo 0 > /sys/devices/system/cpuN/online
either during initrd stage or as a first init script.
That way bad cpu will be _completely_ disabled.Max
--
Hi Max,
Hi Peter,The initrd is from the distribution. I have no sane way to change it
fast and permanent. Can I change the initrd and still have a certified
RHEL or SLES? Are there initrd hooks, which survive packet installation?I would really appreciate some way to keep the kernel from using
a CPU at all to do fault isolation. If possible not even booting it.Bootparameters survived all distro fiddling so far. I love them!
Try to convince a hardware vendor, that you don't have a software bug.
Try to convince him that you didn't break the hardware by swapping it around.So I'll ACK removing isolcpus, if we get a better replacement boot option.
Best Regards
Ingo Oeser
--
Not sure what you meant here. Stuff that I listed has nothing to do with user
That's why I mentioned "first init" script. You can create a simple init.d
compliant script that runs with priority 0 (see /etc/init.d/network for
How does isolcpu= boot option helps in this case ?
I suppose the closes option is maxcpus=. We can probably add ignorecpus= or
I think you're missing the point here. It's like saying
"Lets not switch to electric cars because I use gasoline to kill weeds".As I mentioned before, cpus listed in the isolcpus= boot option will still
handle hard-/soft- irqs, kernel work, kernel timers. You are much better off
using cpu hotplug (ie putting bad cpu offline). Feel free to propose
ignorecpus= option in a separate thread.Max
--
btw Ingo, I just realized that maxcpu= option is exactly what you need.
Here is how you can use it.
Boot your system with maxcpus=1. That way the kernel will only bring up
processor 0. I'm assuming cpu0 is "good" otherwise your system is totally
busted :). Other cpus will stay off-line and will not be initialized.
Then once the system boots you can selectively bring "good" processors online
by doing
echo 1 > /sys/devices/system/cpu/cpuN/onlineThis actually solves the case you're talking about (ie ignoring bad
processors) instead of partially covering it with isolcpus=.Dimitri, you can probably use that too. ie Boot the thing with most CPUs
offline and then bring them online. That way you'll know for sure that no
timers, works, hard-/soft-irqs, etc are running on them.So I expect two ACKs for isolcpu= removal from both of you, in bold please :)
Max
--
Hi Max,
I just tested it on the Ubuntu Hardy standard kernel on an AMD DualCore.
The /sys/devices/system/cpu/cpu1 entry doesn't show up.I can send you the dmesg/config in private, if you want.
Did you test your suggestion?
After our discussion, I tried to find the right spot to implement
"disablecpus=" myself, but couldn't find the right position to hook it up.If your idea would work out, I would ACK it. Even in bold :-)
Best Regards
Ingo Oeser
--
No, I just give random suggestions without verifying them first ;-).
Of course I tried it. Sounds like your Ubuntu kernel does not have "CPU
hotplug / suspend on SMP" enabled. I thought that all distributions enable it
by default these days.Max
--
When you bring a CPU online, in theory the sched domains should get
set up for them, so you should start seeing processes get migrated
onto it, and with them timers work queues etc.If you have irqbalanced running, it probably migrates irqs onto them
as well if it needs to.
--
Sure. My suggestion assumes that system wide balancing is disabled (via top
level cpuset) and that IRQ affinity masks are properly setup. I mentioned that
in my earlier emails.Max
--
Not from me, anyway.
I've seen enough replies (thanks!) from users of isolcpus=
to be quite certain that we should not just remove it outright.I will NAQ such proposals.
And until, and unless, someone comes up with a persuasive answer
I'll probably NAQ proposals to deprecate it as well.
Max ... I think one place where you and I disagree is on whether
it is a good idea to have multiple ways to accomplish the same
thing.Even if you do find a way that seems sane to you, that's not the point,
in my view. Further, given the constraints on producing product that
will fit in with multiple distributions, I doubt that the alternatives
you suggest to Info Oeser would work well for him anyway.A key reason that Linux has succeeded is that it actively seeks to work
for a variety of people, purposes and products. One operating system is
now a strong player in the embedded market, the real time market, and
the High Performance Computing market, as well as being an important
player in a variety of other markets. That's a rather stunning success.If you went to your local grocery store with your (if you have one)
young child, and found that they had no Lucky Charms breakfast cereal
(your childs favorite) you would not be pleased if the store manager
tried to sell you Fruit Loops instead ... just as much sugar and food
coloring.If we have features that seem to duplicate functionality, in a
different way, and that aren't causing us substantial grief to
maintain, and that aren't significantly hurting our performance or
robustness or security or seriously getting in the way of further
development, then we usually leave those features in.Please understand, Max, that for every kernel hacker working in this
corner of the Linux kernel, there are a hundred or a thousand users
depending on what we do, and who will have to adapt to any incompatible
changes we make. If we save ourselves an hour by removing "unnecessary"
features, we can cost a hundred others each some time ada...
We've seen exactly two replies with usage examples. Dimitri's case is legit
but can be handled much better (as in it not only avoids timers, but any other
kernel work) with cpu hotplug and cpusets. Ingo's case is bogus because it
does not actually do what he needs. There is a much better way to do exactly
what he needs which involves only cpu hotplug and has nothing to do with the
Not really. I thought that the two ways that we have are conflicting.
I just looked at the partition_sched_domains() code again and realized that
there is no conflict (cpusets settings override isolcpus=). I was wrong on that.
So I guess there are no reasons to nuke other thanIngo's case is a bad example. If you reread his use case more carefully you'll
see that he was not actually getting what he expected out of the boot param in
question.btw Impressive write up. I do like to think of myself as a Ferrari designer,
actually these days I'm more into http://www.teslamotors.com/ rather than
Ferrari :).
So I agree in general of course. As I mentioned my reasoning was 1) I thought
it conflicts with cpusets and 2) it's considered a hack by the scheduler folks
and is not supported (ie my attempts to extend it were rejected). Given that
there is a better mechanism available it seemed to make sense to nuke it.
Peter Z and Ingo M were of the similar opinion, so it seemed.Anyway, I do not mind us keeping isolcpus= boot option even though use cases
mentioned so far as not very convincing.Max
--
One example I've seen in the past is that someone wanted to isolate a node
completely from any memory traffic to avoid performance disturbance
for memory intensive workloads.Right now the system boot could put pages from some daemon in there before any
cpusets are set up and there's no easy way to get them away again
(short of migratepages for all running pids, but that's pretty ugly and won't
cover kernel level allocations and also can mess up locality)Given the use case wants more a "isolnodes", but given that there
tends to be enough free memory at boot "isolcpus" tended to work.-Andi
--
We (SGI) routinely handle that need with a custom init program,
invoked with the init= parameter to the booting kernel, which
sets up cpusets and then invokes the normal (real) init program
in a cpuset configured to exclude those CPUs and nodes which we
want to remain unloaded. For example, on a 256 CPU, 64 node
system, we might have init running on a single node of 4 CPUs,
and leave the remaining 63 nodes and 252 CPUs isolated from all
the usual user level daemons started by init.There is no need for additional kernel changes to accomplish this.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
There are no additional changes needed, but you must admit that isolcpus
is a much more elegant solutation for this problem than hijacking init.-Andi
--
While I cannot claim that hijacking init is elegant, our gentle
readers are at risk of losing the context here.I was responding to a need you noticed to isolate memory nodes (such as
from stray glibc pages placed by init or the shell running early
scripts), not to the need to isolate CPUs:Granted, this might be a distinction without a difference, because on
the very lightly loaded system seen at boot, local node memory placement
will pretty much guarantee that the memory is placed on the nodes next
to the CPUs on which init or its inelegant replacements are run.So perhaps it boils down to a question of which is easiest to do,
the answer to which will vary depending on where you are in the food
chain of distributions. Here "easy" means least likely to break
something else. All these mechanisms are relatively trivial, until
one has to deal with conflicting software packages, configurations and
distributions, changing out from under oneself.That is, it can be desirable to have multiple mechanisms, so that the
various folks independently needing to manipulate such placement can
minimize stepping on each others feet. By using the rarely hacked init=
mechanism for SGI software addons, we don't interfere with those who
are using the more common isolcpus= mechanism for such purposes as
offlining a bad CPU.In sum, I suspect we agree that we have enough mechanisms, and don't
need an isolnodes as well.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Yes, but in practice (enough memory for bootup) isolating CPUs
is equivalent to isolating nodes. So isolcpus=... tended to work.I occasionally recommended it to people because it was much easier
to explain than replacing init.The perfect solution would be probably just fix it in init(8)
and make it parse some command line option that then sets up
the right cpusets.But you asked for isolcpus=... use cases and I just wanted to describe
One solution would be to move isolcpus=/isonodes= into init(8) and make
sure it's always statically linked. But failing that keeping it in the
kernel is not too bad. It's certainly not a lot of code.On the other hand if the kernel implemented a isolnodes=... it would
be possible to exclude those nodes from the interleaving the kernel
does at boot, which might be also beneficial and give slightly
more isolation.-Andi
--
Definitely ... I'd do the same in such cases.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
btw I'm putting together a set of scripts that I call "syspart" (for "system
partitioning") which creates cpusets, sets up IRQ affinity, etc at boot. We
can make your init replacement a part of that package. You could then tell
people to
1. Install syspart rpm
2. Change boot opts to "init=/sbin/syspart par0_cpus=0-3 par0_mems=0-2 par0_init"The thing will then create /dev/cpuset/par0 with cpus0-3 and mems0-2 and put
/sbin/init into that cpuset.With the exception of the #1 it's as easy to explain as "isolcpus=".
Paul, I beleive you mentioned a while ago that the tools you guys use for this
aren't open source. Has that changed ? If not I'll write my own. I have all
the scripts ready to go but as you pointed out it has to be a standalone
statically linked binary.Thanx
Max--
No change.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Ack.
What do you of that idea btw ? ie Generally available "syspart" thing.
Max
--
> What do you of that idea btw ? ie Generally available "syspart" thing.
I have no particular thoughts one way or the other.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
You do not even need to replace /sbin/init for this, no ?
Simply installing custom
/etc/init.d/create_cpusets
with priority 0
# chkconfig: 12345 0 99
will do the job.That script will move init itself into the appropriate cpuset and from then on
everything will inherit it.Max
--
It won't be the parent of the other init scripts.
-Andi
--
True, but init will be the parent of other init scripts,
and init itself was moved into the appropriate cpuset.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
That can ensure that the deamons that init starts later
on placed, but it doesn't ensure that the glibc pages that
init (and the shell it spawned to run 'create_cpusets')
are placed.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Ah, I missed the memory placement part. As far cpu placement goes the result
would be equivalent but not for memory.btw Are you guys using some kind of castrated, statically linked shell to run
that script ? Otherwise regular shell and friends will suck in the same glibc
pages as the regular init would.Max
--
It (that init= program) is not a script. It is its own
castrated, statically linked special purpose binary.
Once it has done its duty setting up cpusets, it then
exec's the normal init, confined to the cpuset configured
for it.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Got it.
Max
--
The advantage of using a replacement /sbin/init is that you execute
before the rest of userspace, unlike what you propose.--
That does not matter for the cpu placement (ie the end result is the same) but
does matter for memory placement as PaulJ pointed out.Thanx
Max
--
Could be ... I wasn't paying close attention to the details.
If so, a good product marketing manager would first upsell the customer
to the better product, and then let falling sales guide the removal
of the old product.That is, if you can guide most of the users of "isolcpus=" to a better
solution, in -their- view, so that they voluntary choose to migrate
to the other solution, then you get to deprecate and then remove the
old mechanism.To the extent that you can show that the old mechanism is costing us
(maintenance, reliability, performance, impeding progress, ...) then
you get to accelerate the deprecation period, even to the point of
an immediate removal of the old feature, if it's of sufficiently little
use and great pain.We do have one problem with letting "falling sales" guide feature
removal. Unlike Walmart, where they know what has sold where before
the customer has even left the store, we can't easily track usage of
kernel features. Occassionally, we can stir the pot and get some
feedback, as I've done on this thread, if we have a narrow target
audience that we have good reason is especially interested. But that
only works occassionally.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Paul,
We use isolcpus to ensure that boot-time intialization, specifically timer initialization, happens on a specific set of cpus that we won't be using for lower latency purposes. Some of these timers will repeatedly restart themselves on the same cpu and a few do add latency (although admittedly I haven't checked timer latency recently).
Looking at tracebacks in 2.6.26-rc3 from hrtimer_init() and internal_add_timer() things still appear to be working this way, with the timer starting on the originating cpu. If I isolate all but, say one, cpu, timers all seem to start on the unisolated cpu.
A better idea than just removing it.
--
Ah, I know exactly what you're talking about. However this is non-issue these
days. In order to clear cpuN from all the timers and other things all you need
to do is to bring that cpu off-line
echo 0 > /sys/devices/cpu/cpuN/online
and then bring it back online
echo 1 > /sys/devices/cpu/cpuN/onlineThere are currently a couple of issues with scheduler domains and hotplug
event handling. I do have the fix for them, and Paul had already acked it.btw Disabling scheduler load balancer is not enough. Some timers are started
from the hard- and soft- irq handlers. Which means that you have to also
ensure that those CPUs do not handle any irqs (at least during
Because the same functionality is available via more flexible mechanism that
is actively supported. isolcpus= is a static mechanism that requires reboots.
cpusets and cpu hotplug let you dynamically repartition the system at any time.I'd either nuke it or expose it when cpusets are disabled.
In other words
- if cpusets are enabled people should use cpusets to configure cpu resources.
- if cpusets are disabled then we could provide a sysctl (sched_balancer_mask
for example) that lets us control which cpus are balanced and which aren't.Max
--
Although it seemed like something of a hack, we experimented with this
previously and found that it didn't work reliably. I'm sure thingsUntil a proven reliable method for doing this is firmly in place (as
firmly as anything is, anyway), I don't think we should be removingWhat sort of conflict are we talking about? I assume once you've begun setting up cpusets that include those cpus that you're intention is to change the original behavior.
--
Yes it used to be somewhat unstable. These days it solid. I'm using it on a
wide range of systems: uTCA Core2Duo, NUMA dual-Opteron, 8way Core2, etc. And
things work as expected.
I forgot to mention that it's not just timers. There are also work queues and
delayed work that have similar side effects (ie they stick to the CPU they
were originally scheduled on). Hotplug cleans all that stuff very nicely.btw I would not call it a hack. ie Using cpu hotplug for isolation purposes.
By definition hotplug must be able to migrate _everything_ running on the cpuN
when it goes off-line, otherwise it simply won't work. And that's exactly what
we need for the isolation too (migrate everything running on a cpuN to other
That exactly where the conflict is. Lets say you boot with isolcpus=2 (ie cpu2
is not load balanced), then you add cpu2 along with cpu3 to cpuset N and
enable load balancing in cpusetN. In that case cpu2 will still remain
unbalanced which is definitely a wrong behaviour.Max
--
Max,
I tried the following scenario on an ia64 Altix running 2.6.26-rc4 with cpusets compiled in but cpuset fs unmounted. Do your patches already address this?
$ taskset -cp 3 $$ (attach to cpu 3)
pid 4591's current affinity list: 0-3
pid 4591's new affinity list: 3
$ echo 0 > /sys/devices/system/cpu/cpu2/online (down cpu 2)
(above command hangs)Backtrace of pid 4591 (bash)
Call Trace:
[<a00000010078e990>] schedule+0x1210/0x13c0
sp=e0000060b6dffc90 bsp=e0000060b6df11e0
[<a00000010078ef60>] schedule_timeout+0x40/0x180
sp=e0000060b6dffce0 bsp=e0000060b6df11b0
[<a00000010078d3e0>] wait_for_common+0x240/0x3c0
sp=e0000060b6dffd10 bsp=e0000060b6df1180
[<a00000010078d760>] wait_for_completion+0x40/0x60
sp=e0000060b6dffd40 bsp=e0000060b6df1160
[<a000000100114ee0>] __stop_machine_run+0x120/0x160
sp=e0000060b6dffd40 bsp=e0000060b6df1120
[<a000000100765ae0>] _cpu_down+0x2a0/0x600
sp=e0000060b6dffd80 bsp=e0000060b6df10c8
[<a000000100765ea0>] cpu_down+0x60/0xa0
sp=e0000060b6dffe20 bsp=e0000060b6df10a0
[<a000000100768090>] store_online+0x50/0xe0
sp=e0000060b6dffe20 bsp=e0000060b6df1070
[<a0000001004f8800>] sysdev_store+0x60/0xa0
sp=e0000060b6dffe20 bsp=e0000060b6df1038
[<a00000010022e370>] sysfs_write_file+0x250/0x300
sp=e0000060b6dffe20 bsp=e0000060b6df0fe0
[<a00000010018a750>] vfs_write+0x1b0/0x300
sp=e0000060b6dffe20 bsp=e0000060b6df0f90
[<a00000010018b350>] sys_write+0x70/0xe0
sp=e0000060b6dffe20 bsp=e0000060b6df0f18
[<a00000010000af80>] ia64_ret_from_syscall+0x0/0x20
sp=e...
The following workaround alleviates the symptom and hopefully is a hint as to the solution:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
--
Peter, Ingo,
Take a look at the report below (came up during isolcpu= remove discussions).
It looks like stop_machine threads are getting forcefully preempted because
they exceed their RT quanta. It's strange because rt period is pretty long.
But given that disabling rt period logic solves the issue the machine was not
really stuck.Max
--
Yeah, I know, I'm already looking at this
--
I see. Does it look like a bug in the rt period logic ?
Or did the stop_machine thread really run for a long time (in the report that
you got that is) ?Max
--
looks like a fun race between refreshing the period and updating
cpu_online_map.--
Oh, I did not realize that rt period is a timer that iterates online cpus. I
assumed that you do it in the scheduler tick or something.Max
--
Nope. My patch was a trivial fix for not destroying scheduler domains on
hotplug events. The problem you're seeing is different.I'm not an expert in cpu hotplug internal machinery especially on ia64. Recent
kernels (.22 and up) I've tried on x86 and x86-64 have no issues with cpu
hotplug. You probably want to submit a bug report (in a separate thread) maybe
it's a regression in the latest .26-rc series.Max
--
| Alan Cox | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Bart Van Assche | Re: Integration of SCST in the mainstream Linux kernel |
| Andrew Morton | Re: [RFC/PATCH] Documentation of kernel messages |
git: | |
| Winkler, Tomas | RE: iwlwifi: fix build bug in "iwlwifi: fix LED stall" |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Mark Lord | Re: [BUG] New Kernel Bugs |
