Disable bandwidth control by default.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -824,9 +824,9 @@ static __read_mostly int scheduler_runni
/*
* part of the period that we allow rt tasks to run in us.
- * default: 0.95s
+ * default: inf
*/
-int sysctl_sched_rt_runtime = 950000;
+int sysctl_sched_rt_runtime = -1;
static inline u64 global_rt_period(void)
{
--
--
The fixes look good to me, but this enabling of infinite RT task lockups is not an improvement. The thing is, i got far more bugreports about locked up RT tasks where the lockup was unintentional, than real bugreports about anyone _intending_ for the whole box to come to a grinding halt because a high-prio RT tasks is monopolizing the CPU. In fact there's only been this artificial test so far. So could you please just increase the chunking to 10 seconds or so, from the current 1 second? Anyone locking up the system for more than 10 seconds via an RT task has to deal with many other issues already. I.e. keep the system borderline debuggable (up to 10 seconds delays are _not_ nice so people will notice) - but it's still a marked improvement from completly locked up desktops. And those who really need longer than 10 second periods can set it higher, or even (if they want to live dangerously or run POSIX conformance tests) make it infinite (set it to -1) - and will have to deal with other things like the softlockup watchdog as well. Ok? Ingo --
ok - i've queued the fixes up in tip/sched/rt (not in tip/sched/urgent
yet, they need a bit of test-time, but are potential v2.6.27 commits) -
see the shortlog below.
Ingo
------------------>
Ingo Molnar (1):
sched: set rt-bandwidth period from 1 second to 10 seconds
Peter Zijlstra (5):
sched: rt-bandwidth for user grouping interface
sched: rt-bandwidth accounting fix
sched: rt-bandwidth group disable fixes
sched: extract walk_tg_tree()
sched: rt-bandwidth fixes
kernel/sched.c | 215 +++++++++++++++++++++++++++++------------------------
kernel/sched_rt.c | 16 ++--
kernel/user.c | 4 +-
3 files changed, 129 insertions(+), 106 deletions(-)
--
From fc21334298056c1e0d6428d3abe46b104188a05e Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 19 Aug 2008 13:40:47 +0200 Subject: [PATCH] sched: extract walk_tg_tree(), fix fix: kernel/sched.c: In function '__rt_schedulable': kernel/sched.c:8771: error: implicit declaration of function 'walk_tg_tree' kernel/sched.c:8771: error: 'tg_nop' undeclared (first use in this function) kernel/sched.c:8771: error: (Each undeclared identifier is reported only once kernel/sched.c:8771: error: for each function it appears in.) Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 59c6683..10f7ad2 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1387,7 +1387,7 @@ static inline void dec_cpu_load(struct rq *rq, unsigned long load) update_load_sub(&rq->load, load); } -#if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(SCHED_RT_GROUP_SCHED) +#if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_SCHED_RT_GROUP_SCHED) typedef int (*tg_visitor)(struct task_group *, void *); /* --
Why are all these people running poorly written apps then? We don't cater to poorly at the expense of the properly written Nack. Let's retain our API specifications and backwards compatibilty by default. Advertise the sysrq switch and the setting of the sysctl to throttle, but don't break this by default please. --
I agree with you that the 1 second default was a bit too tight - and we should definitely change that (and it's changed already). So changing the "allow RT tasks up to 10 seconds uninterrupted CPU monopolization" is OK to me - it still keeps runaway CPU loops (which are in the vast majority) debuggable, while allowing common-sense RT task usage. But changing that back to the other extreme: "allow lockups by default" is unreasonable IMO - especially in the face of rtlimit that allows unprivileged tasks to gain RT privileges. As an experiment try running a 100% CPU using SCHED_FIFO:99 RT task. It does not result in a usable Linux system - it interacts with too many normal system activities. It is a very, very special mode of operation and anyone using Linux in such a way has to take precautions and has to tune things specially anyway. (has to turn off the softlockup watchdog, has to make sure IO requests do not time out artificially, etc.) You wont even get normal keyboard or console behavior in most cases. Furthermore, if by "API specifications" you mean POSIX - to get a conformant POSIX run one has to change a lot of things on a typical Linux system anyway. APIs and utilities have to be crippled to be "POSIX compliant". In other words: we use common sense when thinking about specifications. The kernel's defaults are about being reasonable by default. I have no _strong_ feelings about it, but i dont see the practical value in going beyond 10 seconds - as it turns a rather useful robustness feature off by default (and keeps it untested, etc.). Ingo --
btw The tuning is actually very easy and straightforward ie not so special anymore. That's one of the use cases that my cpu isolation work was addressing. 2.6.27 will have most of the mechanisms available. All the tuning is done by the 'syspart' package: Same here. I do not mind setting sysctls. At the same time I agree with Nick that ideally we should not change the meaning of SCHED_FIFO. Max --
RT tasks have always been debuggable by using a simple watchdog thread. As I said before, someone who develops a non-trivial RT app without a watchdog thread or isolated CPU basically doesn't deserve the honour of us breaking our API to cater for their idiocity. But even for those people, we now have the sysrq trigger too. And also we'll still have the rt throttle sysctl that can be changed at runtime. There are so many options... "oh but maybe they didn't research the options either so let's break our APIs instead" is not common sense No, it's not "allow lockups by default". It is "follow the API and backwards compatibility by default". If some distro has gone and given all users RTPRIO rlimit by default and allowed unprivileged users to lock up the system, it is not the problem of the upstream kernel. That distro can set the rt throttle default if it wants to. Or provide a watchdog thread for debugging This is exactly what *real* RT app/system developers do. I'm not It's not common sense to change this. It would be perfectly valid to engineer a realtime process that uses a peak of say 90% of the CPU with a 10% margin for safety and other services. Now they only have 5%. Or a realtime app could definitely use the CPU adaptively up to 100% but still unable to tolerate an unexpected preemption. I don't know how you can change this so significantly and be so sure of yourself that you won't break anything (actually you already have one I feel strongly about it. The primary issue is that we have broken the API from both specification and previous implementation, the answer is yes. That *you* can't see any reason to use the API in that way kind of pales in comparison with all due respect. Especially as you already got a counter example of someone's app that broke. --
So... no reply to this? I'm really wondering how it's OK to break documented standards and previous Linux behaviour by default for something that it is trivial to solve in userspace? All the arguments for it IMO are weak, and the argument against is obviously pretty strong but doesn't seem to have been acknolwedged. --
I disagree and what do you mean by "trivial to solve in user-space"? Ingo --
Disagree with what? That it's a problem to basically break the guarantee I mean that if some distro has turned on the RT scheduling ulimit by default and now finds themselves with a local DoS for unpriviliged users as a result, then either that distro should just make their init scripts set the throttle and break the API themselves, or they should start a watchdog at a higher priority than unprivileged user can set. --
I think you are sticking to the rigid letter of some standard without seeing the bigger picture. Firstly, please realize that to do a "successful" POSIX or other conformance run a default Linux distribution has to be tweaked and often crippled literally dozens and often hundreds of ways. In this case you also have to add one more entry to /etc/sysctl.conf, to allow RT tasks to monopolize CPU time. So you can still get the POSIX sticker if you want to - nothing changed about that. Secondly, my big picture point is that our task is to make Linux more useful and more usable by default. You seem to be arguing that RT tasks should be allowed by default to monopolize all CPU time forever, and i disagree with that proposition. But do _you_ actually use such runaway CPU-monopolizing RT tasks? Try it one day and you'll quickly meet various practical problems. Let a SCHED_FIFO:99 RT task run long enough and on all the main distributions you will get: BUG: soft lockup - CPU#1 stuck for 61s! [bash:3659] But monopolizing any resource in a 100% way (which you are arguing for) is just not a generic Linux system and for years (seeing all the practical problems with it) we tried various methods to contain SCHED_FIFO tasks in the scheduler, none was really acceptable for mainline. Peter's changes were clean and useful at last. There's lots of apps that use SCHED_FIFO for a short burst of activity, and 100% of the ones i know do not want to run for longer than 10 seconds. Thirdly, your argument can only be consistent if you also argue for the ... but that's by far not the only usecase. Very frequently i've seen bugreports from people with runaway RT tasks (which tasks were running as root) where that runaway behavior was completely unintended. Audio apps or other apps getting into a loop and locking up the system. Worse than that, such bugs prevented the system from being debugged by plain users. A runaway RT task that monopolizes the CPU will lock it ...
I'm not talking about anything else except this particular interface. I'm also not talking about getting a sticker or anything, but providing Then that's not SCHED_FIFO/SCHED_RT, so just make another scheduling class. SCHED_FIFO and SCHED_RT can use up all CPU time, but that's why they are privileged by default. root has always been able to do silly things, that's nothing new. It is the easiest thing in the world to have made a new scheduling class Again, I'm talking about the upstream kernel, and I'm not actually interested in other bugs or problems because the way to fix things is to solve one bug at a time and not give up just because there are some other bugs. Soft lockup message I don't think causes much pain, except it may be useful to actually panic and do failover with but AFAIKS it is not enabled by default Actually you can pretty well isolate kernel services and interrupts from one CPU and run rt tasks on that. But anyway, who are you to impose a magical And how is that a kernel problem? Should we fix the kernel against Tell the stupid audio program writers to run a watchdog task if they are running a non-trivial amount of code with rt sched policy. Like any Privileged users can break the kernel and kill everyone so easily anyway, Somebody already reported their app failed with 1s. What makes you think there are none around that fail with 10s? Changing old existing userspace APIs can't be done just because a single person (you) can't think of a counter example. Especially not when it could equally be done just by introducing a new No, what's not nice is to subtly change behaviour in a way that's not I disagree. And given the amount of dual core CPUs around these days, I suspect you exaggerate the number of bug reports you get about this too. But anyway as I said, if you're enabling rt prio ulimit by default in your distro and then dislike the local DoS it opens up, then why can't you also just change the rt throttle yourself rather than ...
Your arguments were along the line of: * It probably doesn't break anything (except we had somebody report that it breaks their app) * If it does break something then they must be doing something stupid (I refuted that because there are several legitimate ways to use rt scheduling that is broken by this) * We have many other APIs and tools that don't conform to posix (why is that a reason to break this one?) * We should break the API to cater for stupid users and distros who create local DoS and/or lock up their boxes (except this is trivial to solve by setting sysctls or having a watchdog or using sysrq) So did I miss some really good argument, or do you really think the above arguments are a good reason to break the API? If the latter, then we have to just agree to disagree and I'll ask Linus to arbitrate. OK? --
I'm a real-time oldtimer. An application which hogs the CPU for 9.9 seconds with SCHED_FIFO priority is just broken. It's broken beyond all limits, whether POSIX allows to do that or Linux obeyed the Simply because we use common sense instead of following every single For the vast majority of users and RT developers a sane default of sanity measures is useful and sensible. If someone wants to shoot himself in the foot then it's not an unreasonable request that he needs to disable the safety guards before pulling the trigger. Thanks, tglx --
Oh with this much handwaving from you old timers I feel much better about it ;) I bet before the bug report and change to 10s, any application that hogged the CPU for more than 0.9 seconds was just broken too, right? But 10s is more than enough for everybody? I may not be an old timer, but I can say the kernel is just broken if it deliberately deviates from standards to undocumented behaviour, and even more so if it changes from working to broken behaviour for reasons that can be worked around in userspace (eg. running a higher How is that a brainfart? It is simple, relatively unambiguous, and not arbitrary. You really say the POSIX specified behaviour is "a brainfart", but adding an arbitrary 10s throttle "but the process might be preempted and lose the CPU to a lower priority task if it uses 10s of consecutive You seriously develop complex rt tasks without having at least a simple root is allowed to shoot themselves in the foot. root is the safeguard. --
Actually, any real-time application which hogs the CPU at a high real-time priority for more than one second is probably doing something broken. The whole point of high real-time priorities is to do something really fast, get in and get out. Usually such routines are measured in milliseconds or microseconds. Think about it *this* way --- what would you think of some device driver which hogged an interrupt for a full second, never mind 10 seconds. You'd say it was broken, right? Now consider that a high real-time priority thread might be running at a higher priority than We've not followed POSIX before when it hasn't made sense. For example, "df" and "du" report its output in kilobytes, instead of 512 We've done things before to make things harder for root; for example we've restricted what /dev/mem can do. And root can always lift the ulimit. - Ted --
Sorry, the world of embedded programming is sometime stranger than in theory. Normally it would not happen that a real-time process locks the CPU for more than 1 sec. But in some circumstances, especially FPGA initialisation and long term measurements it is possible that the real-time process locks the cpu for more than a, sometime for more than 10 sec. If the embedded program has designed it in that way, this This has nothing to do with POSIX. It is standard real time behaviour. RT Programming is a job like writing device drivers. U must know what you do. Modify the scheduler in that way that a realtime process will give away the CPU after a given time will certain break some embedded application. Don't think only in desktop or enterprise LINUX boxes, there a much more LINUX embedded devices on this planet and not less of them rely on the old scheduler behaviour. The LINUX base guideline is simple in that way, that the kernel will What coming at next? A device driver manager, which kills any driver which use to much CPU resource? Or throttle/kicks off the responsible driver if the hardware generates to many interrupts? Kernel and embedded real time programmer should know what there do. Stefani --
And if that's true, the embedded program can adjust the ulimit to change the priority levels as appropriately. Real-time programming will always required a bit more configuration, such as what priority various hard and soft interrupt routines will run it. This is just Actually, we have both of these already. :-) - Ted --
Well, we might have a public opinion poll, whether a system is declared frozen after 1, 10 or 100 seconds. Even a one second unresponsivness shows up on the kernel bugzilla and you request that unlimited unresponsivness w/o a chance to debug it is the sane default. An one second RT CPU hog is just a broken application, nothing else. Your precious customer use case is simply crap. Real-time is about determinism and not about the allowance to fuck up a system at will. If a system failed to prevent the fuckup once then this is not at all a guarantee that it allows to do that forever. Especially not in the Open Source space, where developers are still allowed to use their brain and apply common sense to prevent such a wreckage and abuse. Still, your not yet specified use case can continue to do stupid things forever with the simple tweak that it needs to declare itself broken by turning off the kernel sanity Right. I appreciate the nitpicking janitor of the most important POSIX feature: "The unlimited right to monopolize the CPU for any given timeframe." Get your brain together. Just because it worked before and POSIX allows it is not an argument at all that it is something useful. If you want to do this you still can do it by resetting the limit. Your request to enforce that stupid and braindead behaviour on No, I did not say that. All I said is that giving the normal and common sense capable user/developer the chance to debug a runaway task w/o rebooting the system via the power off button is a sensible and useful default. Your request to default to a possibly unusable system serves some yet to be explained higher goal, which is definitely out of the scope of common sense. You still did not explain why this behaviour is useful and your handwaving vs. some (probably closed source) customer application is Dude, don't tell me how to design and debug a real time system. It's not about me, but about the general usability and debuggability of Linux even in ...
That assumes single CPU. With multiple CPUs and not all hogged the system should be still responsive? -Andi --
Right. But also it assumes desktop/general purpose server thing. There may not even be any user interface to be unresponsive. Or it may be something implemented with a userspace driven scheduling system. Or an event loop in a single process. --
Wrong.
Even if the system has multiple CPUs, and even if just a single CPU is
fully utilized by an RT task, without the rt-limit the system will still
lock up in practice due to various other factors: workqueues and tasks
being 'stuck' on CPUs that host an RT hog. While there's obviously CPU
time available on other CPUs, you cannot run 'top', the desktop will
freeze, work flows of the system can be stuck, etc, etc..
With the rt limit in place, it's all pretty smooth and debuggable. Even
with all CPUs hogged by SCHED_FIFO prio 99 the system is laggy but
debuggable - the user can run 'top' and can resolve the situation.
Really, this reply of yours shows something startling: that despite this
many mails you still have never actually tried to run the scenario you
are complaining about: you have never tried to run a CPU hog high-prio
RT task on a Linux system before, and you have never observed the
effects it has on general system stability and debuggability.
This fundamental lack of experience weakens all your arguments and i
dont even know why you are arguing about it. Do you perhaps have some
customer application/workload you are worried about? If you have then
please tell us about the exact specifics - this handwaving about
compliance really makes little sense.
In other words: in our car the air-bag continues to be enabled by
default, and if someone wants to use the car for stunts the air-bag can
be disabled via that handy sysctl.
In any case i think i'm going to ignore this thread from now on, nothing
new has been said really, just the general tone of discussion is
deteriorating. You are also very late with raising objections in any
case - the rt-limit feature has been posted 10 months ago and went
upstream 8 months ago - two full kernel cycles have been completed with
this change in place and a third one has almost been finished.
Ingo
--
The load balancer will not notice that a particular CPU is busy I had such a situation at least once in the past (not due run away RT but due a kernel bug) and even with 2 out of 4 CPUs blocked the system was still quite usable. top/kill definitely worked. The system didn't have a desktop, but I didn't notice many problems in shell use. Ok it's just one sample. That said I don't think having such a limit by default is a bad idea actually. Just handling it in the scheduler anyways is also probably good because it can happen even due to other issues than just run away RT tasks. -Andi --
Not currently, working on that though. --
yeah, that's nice - i tried the earlier iteration of your patch already. It doesnt solve the UP case obviously, nor the case where all CPUs are hogged by RT tasks, nor any other (or future) per CPU aspect of Linux that we have in place currently. Ingo --
I wonder if it would make sense to break affinities in extreme case? With that even the workqueues would work again. -Andi -- ak@linux.intel.com --
Then people can no longer assume stuff like queue_work_on() etc.. works. Users of such code might depend on it actually running on the specified cpu. --
If they assume that they're already buggy because CPU hot unplug will break affinities. -Andi --
It is actually possible (with fairly little work, last time I looked, maybe it is already integrated in the kernel) to avoid all this kind of thing from isolated CPUs. But even then, note that the types of programs using the CPU for long periods are obviously not going to be run on an average desktop system. So the responsiveness argument is laughable. Responsive as defined how? And in relation to what type of systems? --
Please lets not break affinity :).
I'm going to submit patches (soonish) that convert drivers/etc to use
cancel_work_sync()/flush_work() instead of flush_scheduled_work().
That takes care of the
"machine getting stuck because workqueue thread is starved"
case.
Max
--
correct, breaking affinity is a rather stupid idea. Ingo --
Ok let's remove cpu hotunplug then. Probably nobody uses it anyways @) Seriously cpu affinity on all non BP CPU is currently broken on every suspend to RAM, doing it in a few more cases when it makes the system more robust is unlikely to hurt anybody. -Andi -- ak@linux.intel.com --
No, it is right. With caveats. Because you can pretty well isolate a CPU from running kernel threads or work. At any rate, I don't think it When I write rt apps, I run a watchdog thread which detects a hang Of course I have and of course I know what it does if you run a for (;;) rt thread on an ordinary Linux desktop system. Trying to You're continually ignoring all of my arguments and instead raising irrelvant things like this. You ignored others in this thread who replied with real uses of the rt scheduling that is being prevented by this API breakage, and you're ignoring my examples of how it could be used and just keep asserting that "anybody who does that is broken anyway". You also ignored when I told you how you can fix this correctly by introducing new SCHED_xxx scheduling policies that won't break backwards compatibility and will be defined from the outset to be throttled as such. There is no customer issue and there is no handwaving about compliance; it is a black and white issue: this behaviour breaks all documentation, How am I supposed to respond to that? My car doesn't have an air bag but it's breaks don't stop working every 10 seconds. OK, if you don't wish to have further discussion then I will submit a So what? --
well, the reason i'm asking is that i cannot for anything in the world imagine you being so upset about _anything_ but something that involves benchmark runs ;-) And what does SCHED_FIFO RT policy scheduling have to do with performance and benchmarks? Nothing usually in the real world, except for this little known fact: a common 'tuning' for TPC database benchmarks is to run all DB threads as SCHED_FIFO to squeeze the last 0.1% of performance out of the setup. So - and i'm taking an educated guess here - is SCHED_FIFO+TPC performance perhaps one of the factors that played a role in you initiating this thread? If yes then it's obviously an incredibly broken use of SCHED_FIFO and we can add the sysctl tuning to the long list of dozens of other tunings that happen before a TPC run anyway. Hm? Ingo --
;) Well yes as you know I'm not actively doing much scheduler work for a while now. Luckily there are a lot of really good people who probably do a better job on it than me anyway, so on the whole I'm quite happy with it. But ironically that's also why I hadn't raised my concerns earlier... I simply was not aware of the change. So I wish I had participated in the To address this concern: no, it is not tpc ;) Actually I don't know a thing about how tpc except what scant information can basically be gained on the list (disclaimer: I probably could find out more under NDA, but I don't care to). No, there is no customer behind the scenes and nor do I have a use case myself. I really would have told you about it by now. I'm concerned because I honestly think there is a risk of breaking systems. I also think that in this problem space, people often care about guard bands and worst case scenarios so even if the app does not do a cpu hogging polling loop or cooperative scheduling or anything like that, then I think it is risky to add this source of uncertianty. The other issue is that the old behaviour (and, dare I say it, specification) is quite straightforward. At least it is simpler and thus I guess easier to analyze than this behaviour with the added caveat. I realise that as Linux gets better at this, people are wanting to use -rt programs like audio mixing on their desktops and for that kind of thing, throttling is probably often the desired behaviour. So I can see why it was implemented. I just think it is a nasty surprise to have this behaviour by default in the kernel. I hope I explained myself better now. I was not being too constructive when I was getting heated. What I would like to see is maybe a new SCHED_ policy or two which can be defined basically as rt-with-throttle which some apps could use. I also think the sysctl to throttle it is a fine idea. And for desktop installations there is probably a much stronger argument for it. But I disagree with ...
BTW. this is funny that you just decide you can somehow "weaken" my technical arguments because of some of my personal attribute you believe about me. You don't know why I am arguing? I'll put it very simply one more time. - This behaviour has changed the kernel's userspace API in a way that can break existing applications. That is my primary point. If you think it gets somehow weaker because you don't think I have ever locked up my workstation with an RT task, then I give up arguing with you. --
I don't understand the fixation on declaring a system frozen. I repeat: how do you know "rt task code that hogs the CPU for 10s is broken"? This still hasn't been adequately explained to me, and from responses to this What customer use case are you talking about? I never mentioned one and have none. Are you confusing me with someone else? But OK, so if someone else has a customer use case that breaks, what makes you think you can just declare it is crap and we don't care about it? For that matter, what has closed source got to do with it? We don't This is just handwaving and ignoring the issue at hand. SCHED_FIFO and SCHED_RT are exactly about being able to hog the CPU. That is exactly Huh? Again, I don't have a use case, and even ignoring the several posts of people who do, I would still make the same argument because it is plain for me to see that breaking the API by default is the wrong thing Umm... yeah. That's exactly one of the important properties of SCHED_FIFO I don't deny that the runaway task thing is a *small* advantage. But You have it completely backwards. If someone wants to change a userspace API, it is *they* who must not handwave about why "anybody who wants to do that is broken anyway so we don't care about them". I, on the other hand, opposing the API change, sure can handwave or find one or two counter examples as to why we might have users relying on the old behaviour. The replies you got might convince you that your view of the rt world is not the complete and only picture. But if not, then consider that rt tasks need not have a fixed amount of work to be done per unit of time but they may scale work according to the available CPU power. Or it may be something I didn't tell you, I asked you. Do you develop without a watchdog? Do you think the majority of RT developers do? Because if so, then I certianly will tell you to use a watchdog to get the debuggability you ask for, rather than break the kernel interface for everyone else.. ...
Well, I've been working on RT hardware (mostly) and software since 1977. With all due respect, thats crapola. I for one have this requirement and there is _no_ way around it in my world. In fact it's the kernel thats broke by stealing precious usecs from me. From my point of view, as an RT user, any kernel that supports SMP yet can't Again that is also crapola. If i want to shoot myself in the foot, it's none of your concern. I know perfectly well what will happen when I pull the trigger. My 2 cents Regards Mark --
I'm sorry, but I need to agree with this. I've been focused more on RT and in military apps since 1991 (not as long as 77 though :-) There's two issues here. 1) What FIFO means 2) Protecting the 99% of the users What most real RT centric folks will want is the true meaning of FIFO. That is, a FIFO task can run as long as it wants using as much CPU as it wants until a) a higher RT task preempts it, or b) it voluntarily releases the CPU. This change, without doubt, breaks the definition of what a FIFO task is. This is the kernel imposing policy onto userspace. What Thomas Gleixner and Ingo Molnar are doing, is focusing on 2 above. (protecting the 99% of users). This is reasonable, since thats who will bug them the most when things break. The problem I have, is that this is breaking a defined user API. A default that is well known within the RT community. The simple definition of FIFO. What I would suggest is this. 1) Keep the default as the infinite for those that know what they are doing. 2) Change the sysctl scripts in the distros to set the default to a sane time that will protect the users. An RT app that would break the 10s limit would probably be using busybox anyway, so the default for that would be what the kernel comes up with. The default the 99% of users would have, is what the distro set it to for them. This seems like a sane solution to satisfy both camps. -- Steve --
Makes sense to me. It could even get sent out to users about as fast as a new kernel by itself, since they could just add a package dependency to update the init scripts when the end-user installs the new kernel package. Anyone messing with the kernel directly is likely 1) smart enough to deal with existing FIFO semantics, and 2) able to modify their own init scripts to get some additional security if they so desire. Chris --
My biggest concern about adding a limit to FIFO is that an RT developer would spend weeks trying to debug their system wondering why their planned CPU RT hog, is being preempted by a non-RT task. For this, if this time limit does kick in, we should at the very least print something out to let the user know this happened. After all, this is more of a safety net anyway, and if we are hitting the limit, the user should be notified. Perhaps even tell the user that if this behaviour is expected, to up the sysctl <var> by more. Peter, another question. Is this limit for a single RT task running, or all RT tasks. I'm assuming here that it is a single RT task. If you have 20 RT tasks all running, would this let non RT tasks in? In that case, this could be even a bigger issues. Thanks, -- Steve --
yeah, agreed, this is a reasonable suggestion. Peter, do you agree? Ingo --
Seems reasonable. But I still think it should be disabled by default (it might not get caught in testing for example). --
Perhaps we should default it to 1sec, that way it would be hit more often, and educate the users of this now feature. -- Steve --
There only one sane default, as far as I can see. Before anybody attacks me again because I haven't got my brain together or am an annoying standards nitpicker: I'm very well aware of the consequences of unlimited hogging of the CPU. And I know exactly why people might want rt throttling. But just think for a minute the _negative_ consequences of changing the API and remember that is close to the #1 rule of Linux development to not break user API. And put it this way: the sysctl is right there. Any distro that cares about this problem will probably find this thread as #1 hit and work out how to enable the sysctl and break the API if they are happy to do that. On the flip side, not every application development or deployment is even going to know about this, and it may not be trivial to catch in testing, so it could cause failures in the field. --
The issue here is where to place the policy of protecting the user. Is it in the kernel, or is it up to the distro. I've always thought that the policy settings belong in the distro, and the kernel should never enforce a policy (by setting this as default, it is enforcing a policy, even though an RT user can change it). I've recently been told that the kernel has of recent, has indeed been starting to set policies. With protection of memory and such. If this is the case, that the kernel is the place to implement policy, then the "sane" default belongs there. If the distro is the place to instill policy, then that is the place to put the "sane" default. Basically, I'm not in a position to say where Linux should place the default policies (distro or kernel). I've always thought the kernel should be bare bones, allowing the distros to do all the policy settings, and those that compile and build their own kernels/distros do so at their own risks. But if this is no longer the case, then who am I to argue. I guess this decision belongs to those above (Linus, Andrew)? -- Steve --
The kernel has always done a certain amount of "default policy". What do you think things like "swappiness" etc are? Or things like oevrcommit settings? They're all policies, and there is always a default one. So in that sense the kernel always has - and fundamentally _must_ - set some kind of policy. And the default policy should generally be the one that makes sense for most people. Quite frankly, if it's an issue where all normal distros would basically be expected to set a value, then that value should _be_ the default policy, and none of the normal distros should ever need to worry. Whether this case is one such, I dunno. Quite frankly, I don't think it's even _nearly_ important enough to get this kind of noise. Linus --
I guess the reason that this is getting so much noise over other default policies, is that this default policy is changing a well known definition: The meaning of FIFO. By making the default policy limit the time an RT task runs, we have, in essence, changed a user API. Applications that expect to be able to run uninterrupted by SCHED_OTHER tasks, will now break. No one is arguing that this new feature is not useful. The argument is, should the kernel set the default policy of an old well known scheduling policy to something different than what is expected? Distros set SE Linux on by default, should the kernel do that too? -- Steve --
A lot of people I have an immense amount of respect for with vastly differing opinions. There was mention of a user poll so I'll share my .000000002 USD here. I have accepted in my dealings with real-time that it is a special programming paradigm. The developer has much greater control and must exercise it responsibly. From this, I have accepted that I can bring my system to it's knees rather easily if I'm not careful. I agree with Nick and Max that this default behavior should be preserved. I like Steven's suggested of disabling the throttling in the upstream kernel, and leaving it to the distros to safe-gaurd the user from themselves should they choose to. There is already some precedent for this with the updated default kernel thread priorities and realtime group and pam limits.conf settings in Red Hat's MRG product. When doing real-time application development, I use various mechanisms to ensure debugability, and it varies based on what I'm doing and how I access the machine. Sometimes I need special watchdog application, sometimes I need to boost all the kernel threads related to networking or serial consoles and the respective login apps (ssh, agetty, etc.). It seems reasonable to consider this throttling as another _optional_ tool in my debugging toolkit. -- Darren Hart --
More and more are wanting and now finding the Linux kernel to be more RT capable. I seem to remember way back you saying it was one thing you didn't really care much about one way or the other. Thats OK. But, you _are_ the man. Put an end to this. Are you going to allow the long understood meaning of SCHED_FIFO to change in the Linux kernel just to protect a few _supposedly_ bad programmers??? Regards Mark --
The thing is, the reason I dislike RT is that so many people have so different understanding of what RT means. Quite frankly, I think that the people who are complaining (like you) think that RT means "hard realtime". You think about literally specialized devices. A lot of _other_ people think that RT means "good audio latency", where it really is a lot softer. And neither camp seems to ever admit that they are just a small camp, and that the other camp exists or is even valid. And I'm not really interested. Quite frankly, I suspect the "we want to run something like pulseaudio with RT priorities" camp is the more common one, and in that context I understand limiting SCHED_FIFO sounds perfectly understandable. quite frankly, most programmers aren't "supposedly bad". And if you think that the hard-RT "real man" programmers aren't bad, I really have nothing to say. Linus --
The fact that it actually limits a SCHED_FIFO task group, over a single task thread does bother me a little. But that said, I and others have made our complaints known, and will forever be documented in the halls of the Internet abyss. Thus, the verdict has been laid. Seems the default shall be something other than infinite. I will now remain silent. -- Steve --
It bothers me some too. You have to patch/re-compile the kernel if you need to turn it off and don't have SCHED_DEBUG enabled (not free). I tripped over this recently while regression testing. I didn't expect a gaggle of SCHED_RR tasks to be throttled on an otherwise idle box. Hitting that perturbed test results in an unexpected manner, and sent me off on a tangent. -Mike --
/proc/sys/kernel/sched_rt_{runtime,period}_us don't require SCHED_DEBUG.
If they are in any way non-functional on SCHED_DEBUG=n then that's a
clear bug.
--
Gee, you're right. I guess my eyeballs didn't want to see them without their friends. -Mike --
I started this discussion last week with an apparent bug in the new CFS. As it turns out, it was not a bug, it was an feature, a (undocumented?) feature. In the world of embeded device and real time programming it is not a hard job to compile the kernel right for the desired usage und fix the startup script to use the desired policy. Getting back the old behaviour would be nice and in my opinion the right way, because the new one breaks with POSIX. But I have a working solution and that is for me what matters. By the way - RT means not hard real time. Hard-RT is a marketing phrase. A given combination of OS and hardware must handle a event in a given time. Thats all. Thanks for the support. Regards, Stefani, the hard RT "real woman" programmer ;-) --
Is there actually a reason we can't have two forms of SCHED_FIFO. For hard RT the existing behaviour is a lot more useful and it is hard to see "real man" programmers stare at the code in Zen contemplation and debug by powercycling - thats one thing even hard RT processes can't beat. Alan --
There is a difference. You *have* to pick some value for those things. The settings can't necessarily be called correct or incorrect. The default rt sched policy is definitely "broken" in that it very clearly changes our previous behaviour, documentation, and what other systems do. You could say that "realtime" in general is not really a single accepted definition, but *SCHED_FIFO* and *SCHED_RR* in particular do have a well defined, simple, and widely accepted definition that is undeniably changed by this "policy". Given that a) we can easily introduce new SCHED_xxx policies to implement the new behaviour, and b) there are quite a few users of this API in this thread who are concerned about the change, I think it is wisest just to revert to our old behaviour. I thought the rule of thumb is "if in doubt, we don't break user APIs". It's funny that nobody has really answered any of my points of concern. That's cause you don't care about rt that much. You do care about back compatibility though so I thought you'd be more interested. Anyway, I won't post any more. --
I cannot believe you guys are still arguing about this and calling each other stupid/incompetent/braindead and such (not this particular email but all the stuff before) :) Seems to me like leaving RT throttling disabled by default is a reasonable compromise. Several people suggested that and the advantage is that it does not change the definition of SCHED_FIFO/RR by default. I personally do not care that much what the default is. If Fedora, for example, starts enabling it by default I'll still have to change it. So it's not much different from enabled by default in the kernel. Max --
I'm rather surprised at this whole conversation. I think it is pretty simple that. 1. The kernel should not set policy but provide capabilities. a.) It would be more appropriate for a distro to set the policy -. but even here, the default policy should match the expectation of what SCHED_FIFO is and standards such as POSIX unless there is a really really good reason to show why the standard is wrong. (and I haven't heard it here) b.) The fact that it is possible to change the settings is an excellent feature, but that cannot be used as an argument to change the default settings to something unexpected. Rather, the feature can be used to change what the standard default is. 2. SCHED_FIFO doesn't have limitations to it, even if the application programmer can abuse it. That to me seems to be the whole purpose of SCHED_FIFO - it does let you do things if you have the proper privileges that a standard kernel protects against, but if the kernel sets a limitation on it, then it simply isn't SCHED_FIFO anymore, it's something else. I really dislike this talk about what a good application programmer should do anyway, I like that we can be surprised at human creativity and how things can be used in unexpected ways, so I don't see why that should be throttled. And this argument about false kernel lock-ups seems bogus to me too. John --
No its not per task. Its per group (and trivially the !group case is one group). All this bandwidth code comes from RT group scheduling. We do that by assigning a bandwidth to each group so that within that bandwidth each group can use RT tasks and have them behave like they should. I don't fully agree with the statement that the most important thing for SCHED_FIFO is to run as long as you want. The most important thing SCHED_FIFO brings us are deterministic scheduling rules. And RT group scheduling maintains that determinism by using a constand bandwidth assignment. Now the thing that we've been bickering about - bandwidth limits on the root group, which just fell out of the whole ordeal due to symmertry. On the one hand, a program that ran deterministic will still run deterministically at n% (although of course, just like running on less powerfull hardware, you could miss deadlines you previously did not). On the other hand, people might not expect that. Having a lower than 100% bandwidth limit by default gives a safer environment because it avoids total starvation, nor does it take away determinism [*]. It does however bring the risk of surprising a few folks. [*] - there is some added jitter due to the throttling logic, and since the default period might not align nicely with actual deadlines its not perfect. An EDF based scheduler with <100% bandwidth caps would do better. Other scheduling classes have been mentioned... I've been on the point of writing SCHED_ISO, a bandwidth throttled SCHED_FIFO that doesn't require root priviligles and comes with say a 10% bandwidth limit. Doing that should not be too hard - it will just add more code and a bigger configuration space. --
Does this mean, if I have 100 RT tasks, that will together run for 10secs secs, they will only run for 9.5secs? This looks like an even bigger issue. Now we don't have one RT FIFO CPU hog, we are now hitting 100 RT FIFO tasks that try to get a bunch done in 10 secs. -- Steve --
Yes. But say you were doing rate monotonic scheduling (as is not uncommonly done on top of SCHED_FIFO) then you could not get 100% cpu utilisation anyway, as RMS has a ~69% utility bound. --
