Re: [PATCH 6/6] sched: disabled rt-bandwidth by default

Previous thread: none

Next thread: [PATCH 0/6] sched: rt-bandwidth fixes by Peter Zijlstra on Tuesday, August 19, 2008 - 3:33 am. (1 message)
From: Peter Zijlstra
Date: Tuesday, August 19, 2008 - 3:33 am

Disable bandwidth control by default.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -824,9 +824,9 @@ static __read_mostly int scheduler_runni
 
 /*
  * part of the period that we allow rt tasks to run in us.
- * default: 0.95s
+ * default: inf
  */
-int sysctl_sched_rt_runtime = 950000;
+int sysctl_sched_rt_runtime = -1;
 
 static inline u64 global_rt_period(void)
 {

-- 

--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 4:05 am

The fixes look good to me, but this enabling of infinite RT task lockups 
is not an improvement.

The thing is, i got far more bugreports about locked up RT tasks where 
the lockup was unintentional, than real bugreports about anyone 
_intending_ for the whole box to come to a grinding halt because a 
high-prio RT tasks is monopolizing the CPU.

In fact there's only been this artificial test so far.

So could you please just increase the chunking to 10 seconds or so, from 
the current 1 second? Anyone locking up the system for more than 10 
seconds via an RT task has to deal with many other issues already.

I.e. keep the system borderline debuggable (up to 10 seconds delays are 
_not_ nice so people will notice) - but it's still a marked improvement 
from completly locked up desktops.

And those who really need longer than 10 second periods can set it 
higher, or even (if they want to live dangerously or run POSIX 
conformance tests) make it infinite (set it to -1) - and will have to 
deal with other things like the softlockup watchdog as well.

Ok?

	Ingo
--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 4:11 am

ok - i've queued the fixes up in tip/sched/rt (not in tip/sched/urgent 
yet, they need a bit of test-time, but are potential v2.6.27 commits) - 
see the shortlog below.

	Ingo

------------------>
Ingo Molnar (1):
      sched: set rt-bandwidth period from 1 second to 10 seconds

Peter Zijlstra (5):
      sched: rt-bandwidth for user grouping interface
      sched: rt-bandwidth accounting fix
      sched: rt-bandwidth group disable fixes
      sched: extract walk_tg_tree()
      sched: rt-bandwidth fixes


 kernel/sched.c    |  215 +++++++++++++++++++++++++++++------------------------
 kernel/sched_rt.c |   16 ++--
 kernel/user.c     |    4 +-
 3 files changed, 129 insertions(+), 106 deletions(-)

--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 4:42 am

From fc21334298056c1e0d6428d3abe46b104188a05e Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Tue, 19 Aug 2008 13:40:47 +0200
Subject: [PATCH] sched: extract walk_tg_tree(), fix

fix:

 kernel/sched.c: In function '__rt_schedulable':
 kernel/sched.c:8771: error: implicit declaration of function 'walk_tg_tree'
 kernel/sched.c:8771: error: 'tg_nop' undeclared (first use in this function)
 kernel/sched.c:8771: error: (Each undeclared identifier is reported only once
 kernel/sched.c:8771: error: for each function it appears in.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 59c6683..10f7ad2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1387,7 +1387,7 @@ static inline void dec_cpu_load(struct rq *rq, unsigned long load)
 	update_load_sub(&rq->load, load);
 }
 
-#if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(SCHED_RT_GROUP_SCHED)
+#if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_SCHED_RT_GROUP_SCHED)
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 /*

--

From: Nick Piggin
Date: Tuesday, August 19, 2008 - 4:17 am

Why are all these people running poorly written apps then?

We don't cater to poorly at the expense of the properly written


Nack. Let's retain our API specifications and backwards compatibilty
by default. Advertise the sysrq switch and the setting of the sysctl
to throttle, but don't break this by default please.
--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 5:59 am

I agree with you that the 1 second default was a bit too tight - and we 
should definitely change that (and it's changed already).

So changing the "allow RT tasks up to 10 seconds uninterrupted CPU 
monopolization" is OK to me - it still keeps runaway CPU loops (which 
are in the vast majority) debuggable, while allowing common-sense RT 
task usage.

But changing that back to the other extreme: "allow lockups by default" 
is unreasonable IMO - especially in the face of rtlimit that allows 
unprivileged tasks to gain RT privileges.

As an experiment try running a 100% CPU using SCHED_FIFO:99 RT task. It 
does not result in a usable Linux system - it interacts with too many 
normal system activities. It is a very, very special mode of operation 
and anyone using Linux in such a way has to take precautions and has to 
tune things specially anyway. (has to turn off the softlockup watchdog, 
has to make sure IO requests do not time out artificially, etc.) You 
wont even get normal keyboard or console behavior in most cases.

Furthermore, if by "API specifications" you mean POSIX - to get a 
conformant POSIX run one has to change a lot of things on a typical 
Linux system anyway. APIs and utilities have to be crippled to be "POSIX 
compliant".

In other words: we use common sense when thinking about specifications. 
The kernel's defaults are about being reasonable by default.

I have no _strong_ feelings about it, but i dont see the practical value 
in going beyond 10 seconds - as it turns a rather useful robustness 
feature off by default (and keeps it untested, etc.).

	Ingo
--

From: Max Krasnyansky
Date: Tuesday, August 19, 2008 - 11:15 am

btw The tuning is actually very easy and straightforward ie not so 
special anymore. That's one of the use cases that my cpu isolation work 
was addressing. 2.6.27 will have most of the mechanisms available. All 
the tuning is done by the 'syspart' package:
Same here. I do not mind setting sysctls. At the same time I agree with 
Nick that ideally we should not change the meaning of SCHED_FIFO.

Max
--

From: Nick Piggin
Date: Wednesday, August 20, 2008 - 4:56 am

RT tasks have always been debuggable by using a simple watchdog thread.
As I said before, someone who develops a non-trivial RT app without a
watchdog thread or isolated CPU basically doesn't deserve the honour of
us breaking our API to cater for their idiocity.

But even for those people, we now have the sysrq trigger too. And also
we'll still have the rt throttle sysctl that can be changed at runtime.

There are so many options... "oh but maybe they didn't research the
options either so let's break our APIs instead" is not common sense

No, it's not "allow lockups by default". It is "follow the API and
backwards compatibility by default".

If some distro has gone and given all users RTPRIO rlimit by default
and allowed unprivileged users to lock up the system, it is not the
problem of the upstream kernel. That distro can set the rt throttle
default if it wants to. Or provide a watchdog thread for debugging

This is exactly what *real* RT app/system developers do. I'm not


It's not common sense to change this. It would be perfectly valid to
engineer a realtime process that uses a peak of say 90% of the CPU with
a 10% margin for safety and other services. Now they only have 5%.

Or a realtime app could definitely use the CPU adaptively up to 100% but
still unable to tolerate an unexpected preemption.

I don't know how you can change this so significantly and be so sure of
yourself that you won't break anything (actually you already have one

I feel strongly about it.

The primary issue is that we have broken the API from both specification
and previous implementation, the answer is yes. That *you* can't see any
reason to use the API in that way kind of pales in comparison with all
due respect. Especially as you already got a counter example of someone's
app that broke.
--

From: Nick Piggin
Date: Tuesday, August 26, 2008 - 2:00 am

So... no reply to this? I'm really wondering how it's OK to break documented
standards and previous Linux behaviour by default for something that it is
trivial to solve in userspace? All the arguments for it IMO are weak, and
the argument against is obviously pretty strong but doesn't seem to have
been acknolwedged.

--

From: Ingo Molnar
Date: Tuesday, August 26, 2008 - 2:30 am

I disagree and what do you mean by "trivial to solve in user-space"?

	Ingo
--

From: Nick Piggin
Date: Tuesday, August 26, 2008 - 2:44 am

Disagree with what? That it's a problem to basically break the guarantee

I mean that if some distro has turned on the RT scheduling ulimit by
default and now finds themselves with a local DoS for unpriviliged users
as a result, then either that distro should just make their init scripts
set the throttle and break the API themselves, or they should start a
watchdog at a higher priority than unprivileged user can set.
--

From: Ingo Molnar
Date: Tuesday, August 26, 2008 - 3:29 am

I think you are sticking to the rigid letter of some standard without 
seeing the bigger picture.

Firstly, please realize that to do a "successful" POSIX or other 
conformance run a default Linux distribution has to be tweaked and often 
crippled literally dozens and often hundreds of ways. In this case you 
also have to add one more entry to /etc/sysctl.conf, to allow RT tasks 
to monopolize CPU time. So you can still get the POSIX sticker if you 
want to - nothing changed about that.

Secondly, my big picture point is that our task is to make Linux more 
useful and more usable by default. You seem to be arguing that RT tasks 
should be allowed by default to monopolize all CPU time forever, and i 
disagree with that proposition.

But do _you_ actually use such runaway CPU-monopolizing RT tasks? Try it 
one day and you'll quickly meet various practical problems. Let a 
SCHED_FIFO:99 RT task run long enough and on all the main distributions 
you will get:

  BUG: soft lockup - CPU#1 stuck for 61s! [bash:3659]

But monopolizing any resource in a 100% way (which you are arguing for) 
is just not a generic Linux system and for years (seeing all the 
practical problems with it) we tried various methods to contain 
SCHED_FIFO tasks in the scheduler, none was really acceptable for 
mainline.

Peter's changes were clean and useful at last. There's lots of apps that 
use SCHED_FIFO for a short burst of activity, and 100% of the ones i 
know do not want to run for longer than 10 seconds.

Thirdly, your argument can only be consistent if you also argue for the 

... but that's by far not the only usecase. Very frequently i've seen 
bugreports from people with runaway RT tasks (which tasks were running 
as root) where that runaway behavior was completely unintended. Audio 
apps or other apps getting into a loop and locking up the system.

Worse than that, such bugs prevented the system from being debugged by 
plain users. A runaway RT task that monopolizes the CPU will lock it ...
From: Nick Piggin
Date: Tuesday, August 26, 2008 - 4:03 am

I'm not talking about anything else except this particular interface.
I'm also not talking about getting a sticker or anything, but providing

Then that's not SCHED_FIFO/SCHED_RT, so just make another scheduling class.
SCHED_FIFO and SCHED_RT can use up all CPU time, but that's why they are
privileged by default. root has always been able to do silly things, that's
nothing new.

It is the easiest thing in the world to have made a new scheduling class

Again, I'm talking about the upstream kernel, and I'm not actually interested
in other bugs or problems because the way to fix things is to solve one bug
at a time and not give up just because there are some other bugs.

Soft lockup message I don't think causes much pain, except it may be useful to
actually panic and do failover with but AFAIKS it is not enabled by default

Actually you can pretty well isolate kernel services and interrupts from one
CPU and run rt tasks on that. But anyway, who are you to impose a magical


And how is that a kernel problem? Should we fix the kernel against

Tell the stupid audio program writers to run a watchdog task if they
are running a non-trivial amount of code with rt sched policy. Like any

Privileged users can break the kernel and kill everyone so easily anyway,

Somebody already reported their app failed with 1s. What makes you
think there are none around that fail with 10s? Changing old existing
userspace APIs can't be done just because a single person (you) can't
think of a counter example.

Especially not when it could equally be done just by introducing a new


No, what's not nice is to subtly change behaviour in a way that's not

I disagree.

And given the amount of dual core CPUs around these days, I suspect you
exaggerate the number of bug reports you get about this too. But anyway
as I said, if you're enabling rt prio ulimit by default in your distro
and then dislike the local DoS it opens up, then why can't you also just
change the rt throttle yourself rather than ...
From: Nick Piggin
Date: Tuesday, August 26, 2008 - 2:54 am

Your arguments were along the line of:

* It probably doesn't break anything (except we had somebody report
  that it breaks their app)

* If it does break something then they must be doing something stupid
  (I refuted that because there are several legitimate ways to use rt
  scheduling that is broken by this)

* We have many other APIs and tools that don't conform to posix (why
  is that a reason to break this one?)

* We should break the API to cater for stupid users and distros who
  create local DoS and/or lock up their boxes (except this is trivial
  to solve by setting sysctls or having a watchdog or using sysrq)

So did I miss some really good argument, or do you really think the
above arguments are a good reason to break the API? If the latter,
then we have to just agree to disagree and I'll ask Linus to arbitrate.
OK?


--

From: Thomas Gleixner
Date: Tuesday, August 26, 2008 - 4:09 am

I'm a real-time oldtimer. An application which hogs the CPU for 9.9
seconds with SCHED_FIFO priority is just broken. It's broken beyond
all limits, whether POSIX allows to do that or Linux obeyed the

Simply because we use common sense instead of following every single

For the vast majority of users and RT developers a sane default of
sanity measures is useful and sensible. 

If someone wants to shoot himself in the foot then it's not an
unreasonable request that he needs to disable the safety guards before
pulling the trigger.

Thanks,

	tglx
--

From: Nick Piggin
Date: Tuesday, August 26, 2008 - 4:27 am

Oh with this much handwaving from you old timers I feel much better
about it ;) I bet before the bug report and change to 10s, any
application that hogged the CPU for more than 0.9 seconds was just
broken too, right? But 10s is more than enough for everybody?

I may not be an old timer, but I can say the kernel is just broken
if it deliberately deviates from standards to undocumented behaviour,
and even more so if it changes from working to broken behaviour for
reasons that can be worked around in userspace (eg. running a higher

How is that a brainfart? It is simple, relatively unambiguous, and not
arbitrary. You really say the POSIX specified behaviour is "a brainfart",
but adding an arbitrary 10s throttle "but the process might be preempted
and lose the CPU to a lower priority task if it uses 10s of consecutive

You seriously develop complex rt tasks without having at least a simple

root is allowed to shoot themselves in the foot. root is the safeguard.
--

From: Theodore Tso
Date: Tuesday, August 26, 2008 - 5:50 am

Actually, any real-time application which hogs the CPU at a high
real-time priority for more than one second is probably doing
something broken.  The whole point of high real-time priorities is to
do something really fast, get in and get out.  Usually such routines
are measured in milliseconds or microseconds.

Think about it *this* way --- what would you think of some device
driver which hogged an interrupt for a full second, never mind 10
seconds.  You'd say it was broken, right?  Now consider that a high
real-time priority thread might be running at a higher priority than

We've not followed POSIX before when it hasn't made sense.  For
example, "df" and "du" report its output in kilobytes, instead of 512

We've done things before to make things harder for root; for example
we've restricted what /dev/mem can do.  And root can always lift the
ulimit.

						- Ted
--

From: Stefani Seibold
Date: Tuesday, August 26, 2008 - 6:31 am

Sorry, the world of embedded programming is sometime stranger than in
theory. Normally it would not happen that a real-time process locks the
CPU for more than 1 sec. But in some circumstances, especially FPGA
initialisation and long term measurements it is possible that the
real-time process locks the cpu for more than a, sometime for more than
10 sec. If the embedded program has designed it in that way, this

This has nothing to do with POSIX. It is standard real time behaviour.
RT Programming is a job like writing device drivers. U must know what
you do. 

Modify the scheduler in that way that a realtime process will give away
the CPU after a given time will certain break some embedded application.

Don't think only in desktop or enterprise LINUX boxes, there a much more
LINUX embedded devices on this planet and not less of them rely on the
old scheduler behaviour.
 
The LINUX base guideline is simple in that way, that the kernel will

What coming at next? A device driver manager, which kills any driver
which use to much CPU resource? Or throttle/kicks off the responsible
driver if the hardware generates to many interrupts?

Kernel and embedded real time programmer should know what there do.

Stefani


--

From: Theodore Tso
Date: Tuesday, August 26, 2008 - 10:55 am

And if that's true, the embedded program can adjust the ulimit to
change the priority levels as appropriately.  Real-time programming
will always required a bit more configuration, such as what priority
various hard and soft interrupt routines will run it.  This is just

Actually, we have both of these already.  :-)

						- Ted
--

From: Thomas Gleixner
Date: Tuesday, August 26, 2008 - 2:37 pm

Well, we might have a public opinion poll, whether a system is
declared frozen after 1, 10 or 100 seconds. Even a one second
unresponsivness shows up on the kernel bugzilla and you request that
unlimited unresponsivness w/o a chance to debug it is the sane
default.

An one second RT CPU hog is just a broken application, nothing
else. Your precious customer use case is simply crap.

Real-time is about determinism and not about the allowance to fuck up
a system at will. If a system failed to prevent the fuckup once then
this is not at all a guarantee that it allows to do that forever.

Especially not in the Open Source space, where developers are still
allowed to use their brain and apply common sense to prevent such a
wreckage and abuse. Still, your not yet specified use case can
continue to do stupid things forever with the simple tweak that it
needs to declare itself broken by turning off the kernel sanity

Right. I appreciate the nitpicking janitor of the most important POSIX
feature: 

"The unlimited right to monopolize the CPU for any given timeframe."

Get your brain together. Just because it worked before and POSIX
allows it is not an argument at all that it is something useful. If
you want to do this you still can do it by resetting the limit.

Your request to enforce that stupid and braindead behaviour on

No, I did not say that. All I said is that giving the normal and
common sense capable user/developer the chance to debug a runaway task
w/o rebooting the system via the power off button is a sensible and
useful default.

Your request to default to a possibly unusable system serves some yet
to be explained higher goal, which is definitely out of the scope of
common sense.

You still did not explain why this behaviour is useful and your
handwaving vs. some (probably closed source) customer application is

Dude, don't tell me how to design and debug a real time system. 

It's not about me, but about the general usability and debuggability
of Linux even in ...
From: Andi Kleen
Date: Tuesday, August 26, 2008 - 3:49 pm

That assumes single CPU. With multiple CPUs and not
all hogged the system should be still responsive? 

-Andi
--

From: Nick Piggin
Date: Wednesday, August 27, 2008 - 3:08 am

Right.

But also it assumes desktop/general purpose server thing.

There may not even be any user interface to be unresponsive. Or it
may be something implemented with a userspace driven scheduling
system. Or an event loop in a single process.
--

From: Ingo Molnar
Date: Thursday, August 28, 2008 - 3:54 am

Wrong.

Even if the system has multiple CPUs, and even if just a single CPU is 
fully utilized by an RT task, without the rt-limit the system will still 
lock up in practice due to various other factors: workqueues and tasks 
being 'stuck' on CPUs that host an RT hog. While there's obviously CPU 
time available on other CPUs, you cannot run 'top', the desktop will 
freeze, work flows of the system can be stuck, etc, etc..

With the rt limit in place, it's all pretty smooth and debuggable. Even 
with all CPUs hogged by SCHED_FIFO prio 99 the system is laggy but 
debuggable - the user can run 'top' and can resolve the situation.

Really, this reply of yours shows something startling: that despite this 
many mails you still have never actually tried to run the scenario you 
are complaining about: you have never tried to run a CPU hog high-prio 
RT task on a Linux system before, and you have never observed the 
effects it has on general system stability and debuggability.

This fundamental lack of experience weakens all your arguments and i 
dont even know why you are arguing about it. Do you perhaps have some 
customer application/workload you are worried about? If you have then 
please tell us about the exact specifics - this handwaving about 
compliance really makes little sense.

In other words: in our car the air-bag continues to be enabled by 
default, and if someone wants to use the car for stunts the air-bag can 
be disabled via that handy sysctl.

In any case i think i'm going to ignore this thread from now on, nothing 
new has been said really, just the general tone of discussion is 
deteriorating. You are also very late with raising objections in any 
case - the rt-limit feature has been posted 10 months ago and went 
upstream 8 months ago - two full kernel cycles have been completed with 
this change in place and a third one has almost been finished.

        Ingo
--

From: Andi Kleen
Date: Thursday, August 28, 2008 - 4:09 am

The load balancer will not notice that a particular CPU is busy

I had such a situation at least once in the past (not due
run away RT but due a kernel bug) and even with 2 out of 4 CPUs blocked 
the system was still quite usable. top/kill definitely worked.  The system 
didn't have a desktop, but I didn't notice many problems in shell use. 
Ok it's just one sample.

That said I don't think having such a limit by default is a bad idea actually.

Just handling it in the scheduler anyways is also probably good because
it can happen even due to other issues than just run away RT tasks.

-Andi
--

From: Peter Zijlstra
Date: Thursday, August 28, 2008 - 4:19 am

Not currently, working on that though.

--

From: Ingo Molnar
Date: Thursday, August 28, 2008 - 4:28 am

yeah, that's nice - i tried the earlier iteration of your patch already. 
It doesnt solve the UP case obviously, nor the case where all CPUs are 
hogged by RT tasks, nor any other (or future) per CPU aspect of Linux 
that we have in place currently.

	Ingo
--

From: Andi Kleen
Date: Thursday, August 28, 2008 - 4:50 am

I wonder if it would make sense to break affinities in extreme case?
With that even the workqueues would work again.

-Andi

-- 
ak@linux.intel.com
--

From: Peter Zijlstra
Date: Thursday, August 28, 2008 - 5:00 am

Then people can no longer assume stuff like queue_work_on() etc.. works.
Users of such code might depend on it actually running on the specified
cpu.



--

From: Andi Kleen
Date: Thursday, August 28, 2008 - 5:14 am

If they assume that they're already buggy because CPU hot unplug will break
affinities.

-Andi
--

From: Nick Piggin
Date: Thursday, August 28, 2008 - 5:18 am

It is actually possible (with fairly little work, last time I looked,
maybe it is already integrated in the kernel) to avoid all this kind of
thing from isolated CPUs.

But even then, note that the types of programs using the CPU for long
periods are obviously not going to be run on an average desktop system.
So the responsiveness argument is laughable. Responsive as defined how?
And in relation to what type of systems?
--

From: Max Krasnyansky
Date: Thursday, August 28, 2008 - 9:19 am

Please lets not break affinity :).

I'm going to submit patches (soonish) that convert drivers/etc to use 
cancel_work_sync()/flush_work() instead of flush_scheduled_work().
That takes care of the
     "machine getting stuck because workqueue thread is starved"
case.

Max
--

From: Ingo Molnar
Date: Thursday, August 28, 2008 - 9:25 am

correct, breaking affinity is a rather stupid idea.

	Ingo
--

From: Andi Kleen
Date: Thursday, August 28, 2008 - 9:33 am

Ok let's remove cpu hotunplug then.  Probably nobody uses it anyways @)

Seriously cpu affinity on all non BP CPU is currently broken on every
suspend to RAM, doing it in a few more cases when it makes the system
more robust is unlikely to hurt anybody.

-Andi

-- 
ak@linux.intel.com

--

From: Nick Piggin
Date: Thursday, August 28, 2008 - 5:03 am

No, it is right. With caveats. Because you can pretty well isolate a
CPU from running kernel threads or work. At any rate, I don't think it

When I write rt apps, I run a watchdog thread which detects a hang

Of course I have and of course I know what it does if you run a
for (;;) rt thread on an ordinary Linux desktop system. Trying to

You're continually ignoring all of my arguments and instead raising
irrelvant things like this.

You ignored others in this thread who replied with real uses of the
rt scheduling that is being prevented by this API breakage, and
you're ignoring my examples of how it could be used and just keep
asserting that "anybody who does that is broken anyway".

You also ignored when I told you how you can fix this correctly by
introducing new SCHED_xxx scheduling policies that won't break
backwards compatibility and will be defined from the outset to be
throttled as such.

There is no customer issue and there is no handwaving about compliance;
it is a black and white issue: this behaviour breaks all documentation,

How am I supposed to respond to that? My car doesn't have an air bag
but it's breaks don't stop working every 10 seconds.


OK, if you don't wish to have further discussion then I will submit a

So what?
--

From: Ingo Molnar
Date: Thursday, August 28, 2008 - 6:07 am

well, the reason i'm asking is that i cannot for anything in the world 
imagine you being so upset about _anything_ but something that involves 
benchmark runs ;-)

And what does SCHED_FIFO RT policy scheduling have to do with 
performance and benchmarks? Nothing usually in the real world, except 
for this little known fact: a common 'tuning' for TPC database 
benchmarks is to run all DB threads as SCHED_FIFO to squeeze the last 
0.1% of performance out of the setup.

So - and i'm taking an educated guess here - is SCHED_FIFO+TPC 
performance perhaps one of the factors that played a role in you 
initiating this thread? If yes then it's obviously an incredibly broken 
use of SCHED_FIFO and we can add the sysctl tuning to the long list of 
dozens of other tunings that happen before a TPC run anyway.

Hm?

	Ingo
--

From: Nick Piggin
Date: Thursday, August 28, 2008 - 6:45 am

;) Well yes as you know I'm not actively doing much scheduler work for
a while now. Luckily there are a lot of really good people who probably
do a better job on it than me anyway, so on the whole I'm quite happy
with it.

But ironically that's also why I hadn't raised my concerns earlier... I
simply was not aware of the change. So I wish I had participated in the

To address this concern: no, it is not tpc ;) Actually I don't know a
thing about how tpc except what scant information can basically be
gained on the list (disclaimer: I probably could find out more under
NDA, but I don't care to).

No, there is no customer behind the scenes and nor do I have a use
case myself. I really would have told you about it by now.

I'm concerned because I honestly think there is a risk of breaking
systems. I also think that in this problem space, people often care
about guard bands and worst case scenarios so even if the app does
not do a cpu hogging polling loop or cooperative scheduling or
anything like that, then I think it is risky to add this source of
uncertianty.

The other issue is that the old behaviour (and, dare I say it,
specification) is quite straightforward. At least it is simpler and thus
I guess easier to analyze than this behaviour with the added caveat.

I realise that as Linux gets better at this, people are wanting to use
-rt programs like audio mixing on their desktops and for that kind of
thing, throttling is probably often the desired behaviour. So I can
see why it was implemented. I just think it is a nasty surprise to
have this behaviour by default in the kernel.

I hope I explained myself better now. I was not being too constructive
when I was getting heated.

What I would like to see is maybe a new SCHED_ policy or two which can
be defined basically as rt-with-throttle which some apps could use. I
also think the sysctl to throttle it is a fine idea. And for desktop
installations there is probably a much stronger argument for it. But I
disagree with ...
From: Nick Piggin
Date: Thursday, August 28, 2008 - 5:29 am

BTW. this is funny that you just decide you can somehow "weaken"
my technical arguments because of some of my personal attribute
you believe about me.

You don't know why I am arguing? I'll put it very simply one more
time.

- This behaviour has changed the kernel's userspace API in a way
  that can break existing applications.

That is my primary point. If you think it gets somehow weaker
because you don't think I have ever locked up my workstation with
an RT task, then I give up arguing with you.
--

From: Nick Piggin
Date: Wednesday, August 27, 2008 - 3:04 am

I don't understand the fixation on declaring a system frozen. I repeat:
how do you know "rt task code that hogs the CPU for 10s is broken"? This
still hasn't been adequately explained to me, and from responses to this

What customer use case are you talking about? I never mentioned one and
have none. Are you confusing me with someone else?

But OK, so if someone else has a customer use case that breaks, what
makes you think you can just declare it is crap and we don't care about
it? For that matter, what has closed source got to do with it? We don't


This is just handwaving and ignoring the issue at hand. SCHED_FIFO and
SCHED_RT are exactly about being able to hog the CPU. That is exactly

Huh? Again, I don't have a use case, and even ignoring the several posts
of people who do, I would still make the same argument because it is
plain for me to see that breaking the API by default is the wrong thing

Umm... yeah. That's exactly one of the important properties of SCHED_FIFO


I don't deny that the runaway task thing is a *small* advantage. But

You have it completely backwards. If someone wants to change a userspace API,
it is *they* who must not handwave about why "anybody who wants to do that is
broken anyway so we don't care about them".

I, on the other hand, opposing the API change, sure can handwave or find one
or two counter examples as to why we might have users relying on the old
behaviour.

The replies you got might convince you that your view of the rt world is not
the complete and only picture. But if not, then consider that rt tasks need
not have a fixed amount of work to be done per unit of time but they may
scale work according to the available CPU power. Or it may be something

I didn't tell you, I asked you. Do you develop without a watchdog? Do
you think the majority of RT developers do?

Because if so, then I certianly will tell you to use a watchdog to get
the debuggability you ask for, rather than break the kernel interface
for everyone else.. ...
From: Mark Hounschell
Date: Tuesday, August 26, 2008 - 6:47 am

Well, I've been working on RT hardware (mostly) and software since 1977.
With all due respect, thats crapola. I for one have this requirement and
there is _no_ way around it in my world. In fact it's the kernel thats broke
by stealing precious usecs from me.

From my point of view, as an RT user, any kernel that supports SMP yet can't 

Again that is also crapola. If i want to shoot myself in the foot, it's
none of your concern. I know perfectly well what will happen when 
I pull the trigger. 

My 2 cents
Regards
Mark
--

From: Steven Rostedt
Date: Tuesday, August 26, 2008 - 4:00 pm

I'm sorry, but I need to agree with this. I've been focused more on RT
and in military apps since 1991 (not as long as 77 though :-)

There's two issues here.

1) What FIFO means

2) Protecting the 99% of the users


What most real RT centric folks will want is the true meaning of FIFO.
That is, a FIFO task can run as long as it wants using as much CPU as it
wants until a) a higher RT task preempts it, or b) it voluntarily
releases the CPU.

This change, without doubt, breaks the definition of what a FIFO task
is. This is the kernel imposing policy onto userspace.

What Thomas Gleixner and Ingo Molnar are doing, is focusing on 2 above.
(protecting the 99% of users).  This is reasonable, since thats who will
bug them the most when things break.

The problem I have, is that this is breaking a defined user API. A
default that is well known within the RT community. The simple
definition of FIFO.

What I would suggest is this.

1) Keep the default as the infinite for those that know what they are
   doing.

2) Change the sysctl scripts in the distros to set the default to a sane
  time that will protect the users.

An RT app that would break the 10s limit would probably be using busybox
anyway, so the default for that would be what the kernel comes up with.

The default the 99% of users would have, is what the distro set it to
for them.

This seems like a sane solution to satisfy both camps.

-- Steve

--

From: Chris Friesen
Date: Wednesday, August 27, 2008 - 11:55 am

Makes sense to me.  It could even get sent out to users about as fast as 
a new kernel by itself, since they could just add a package dependency 
to update the init scripts when the end-user installs the new kernel 
package.

Anyone messing with the kernel directly is likely 1) smart enough to 
deal with existing FIFO semantics, and 2) able to modify their own init 
scripts to get some additional security if they so desire.

Chris
--

From: Steven Rostedt
Date: Thursday, August 28, 2008 - 7:15 am

My biggest concern about adding a limit to FIFO is that an RT developer
would spend weeks trying to debug their system wondering why their
planned CPU RT hog, is being preempted by a non-RT task.

For this, if this time limit does kick in, we should at the very least
print something out to let the user know this happened. After all, this
is more of a safety net anyway, and if we are hitting the limit, the
user should be notified. Perhaps even tell the user that if this
behaviour is expected, to up the sysctl <var> by more.

Peter, another question. Is this limit for a single RT task running, or
all RT tasks. I'm assuming here that it is a single RT task. If you have
20 RT tasks all running, would this let non RT tasks in? In that case,
this could be even a bigger issues.

Thanks,

-- Steve

--

From: Ingo Molnar
Date: Thursday, August 28, 2008 - 7:30 am

yeah, agreed, this is a reasonable suggestion. Peter, do you agree?

	Ingo
--

From: Nick Piggin
Date: Thursday, August 28, 2008 - 7:36 am

Seems reasonable. But I still think it should be disabled by default
(it might not get caught in testing for example).
--

From: Steven Rostedt
Date: Thursday, August 28, 2008 - 8:12 am

Perhaps we should default it to 1sec, that way it would be hit more often, 
and educate the users of this now feature.

-- Steve

--

From: Nick Piggin
Date: Thursday, August 28, 2008 - 8:34 am

There only one sane default, as far as I can see.

Before anybody attacks me again because I haven't got my brain together or
am an annoying standards nitpicker:

I'm very well aware of the consequences of unlimited hogging of the CPU.
And I know exactly why people might want rt throttling. But just think for
a minute the _negative_ consequences of changing the API and remember that
is close to the #1 rule of Linux development to not break user API.

And put it this way: the sysctl is right there. Any distro that cares about
this problem will probably find this thread as #1 hit and work out how to
enable the sysctl and break the API if they are happy to do that. On the
flip side, not every application development or deployment is even going to
know about this, and it may not be trivial to catch in testing, so it could
cause failures in the field.
--

From: Steven Rostedt
Date: Thursday, August 28, 2008 - 8:50 am

The issue here is where to place the policy of protecting the user. Is it 
in the kernel, or is it up to the distro.

I've always thought that the policy settings belong in the distro, and the 
kernel should never enforce a policy (by setting this as default, it is 
enforcing a policy, even though an RT user can change it).

I've recently been told that the kernel has of recent, has indeed been 
starting to set policies. With protection of memory and such. If this is 
the case, that the kernel is the place to implement policy, then the 
"sane" default belongs there. If the distro is the place to instill 
policy, then that is the place to put the "sane" default.

Basically, I'm not in a position to say where Linux should place the 
default policies (distro or kernel). I've always thought the kernel should 
be bare bones, allowing the distros to do all the policy settings, and 
those that compile and build their own kernels/distros do so at their own 
risks.  But if this is no longer the case, then who am I to argue.

I guess this decision belongs to those above (Linus, Andrew)?

-- Steve

--

From: Linus Torvalds
Date: Thursday, August 28, 2008 - 10:26 am

The kernel has always done a certain amount of "default policy". 

What do you think things like "swappiness" etc are? Or things like 
oevrcommit settings? They're all policies, and there is always a default 
one. So in that sense the kernel always has - and fundamentally _must_ - 
set some kind of policy.

And the default policy should generally be the one that makes sense for 
most people. Quite frankly, if it's an issue where all normal distros 
would basically be expected to set a value, then that value should _be_ 
the default policy, and none of the normal distros should ever need to 
worry.

Whether this case is one such, I dunno. Quite frankly, I don't think it's 
even _nearly_ important enough to get this kind of noise.

		Linus
--

From: Steven Rostedt
Date: Thursday, August 28, 2008 - 11:04 am

I guess the reason that this is getting so much noise over other default 
policies, is that this default policy is changing a well known definition:
The meaning of FIFO.

By making the default policy limit the time an RT task runs, we have, in 
essence, changed a user API. Applications that expect to be able to run  
uninterrupted by SCHED_OTHER tasks, will now break.

No one is arguing that this new feature is not useful. The argument is, 
should the kernel set the default policy of an old well known scheduling 
policy to something different than what is expected?

Distros set SE Linux on by default, should the kernel do that too?

-- Steve

--

From: Darren Hart
Date: Thursday, August 28, 2008 - 11:10 am

A lot of people I have an immense amount of respect for with vastly differing
opinions.  There was mention of a user poll so I'll share my .000000002 USD
here.

I have accepted in my dealings with real-time that it is a special programming
paradigm.  The developer has much greater control and must exercise it
responsibly.  From this, I have accepted that I can bring my system to it's
knees rather easily if I'm not careful.  I agree with Nick and Max that this
default behavior should be preserved.  I like Steven's suggested of disabling
the throttling in the upstream kernel, and leaving it to the distros to
safe-gaurd the user from themselves should they choose to.  There is already
some precedent for this with the updated default kernel thread priorities and
realtime group and pam limits.conf settings in Red Hat's MRG product.  When
doing real-time application development, I use various mechanisms to ensure
debugability, and it varies based on what I'm doing and how I access the
machine.  Sometimes I need special watchdog application, sometimes I need to
boost all the kernel threads related to networking or serial consoles and the
respective login apps (ssh, agetty, etc.).  It seems reasonable to consider
this throttling as another _optional_ tool in my debugging toolkit.

--
Darren Hart
--

From: Mark Hounschell
Date: Thursday, August 28, 2008 - 11:16 am

More and more are wanting and now finding the Linux kernel to be more
RT capable. I seem to remember way back you saying it was one thing 
you didn't really care much about one way or the other. Thats OK. But, 
you _are_ the man. Put an end to this. Are you going to allow the long
understood meaning of SCHED_FIFO to change in the Linux kernel 
just to protect a few _supposedly_ bad programmers???

Regards
Mark

--

From: Linus Torvalds
Date: Thursday, August 28, 2008 - 11:42 am

The thing is, the reason I dislike RT is that so many people have so 
different understanding of what RT means.

Quite frankly, I think that the people who are complaining (like you) 
think that RT means "hard realtime". You think about literally specialized 
devices.

A lot of _other_ people think that RT means "good audio latency", where it 
really is a lot softer. 

And neither camp seems to ever admit that they are just a small camp, and 
that the other camp exists or is even valid.

And I'm not really interested. Quite frankly, I suspect the "we want to 
run something like pulseaudio with RT priorities" camp is the more common 
one, and in that context I understand limiting SCHED_FIFO sounds perfectly 
understandable.


quite frankly, most programmers aren't "supposedly bad". And if you think 
that the hard-RT "real man" programmers aren't bad, I really have nothing 
to say.

		Linus

--

From: Steven Rostedt
Date: Thursday, August 28, 2008 - 11:53 am

The fact that it actually limits a SCHED_FIFO task group, over a single 
task thread does bother me a little.

But that said, I and others have made our complaints known, and will 
forever be documented in the halls of the Internet abyss. Thus, the 
verdict has been laid. Seems the default shall be something other than 
infinite.

I will now remain silent.

-- Steve

--

From: Mike Galbraith
Date: Friday, August 29, 2008 - 12:56 am

It bothers me some too.  You have to patch/re-compile the kernel if you
need to turn it off and don't have SCHED_DEBUG enabled (not free).

I tripped over this recently while regression testing.  I didn't expect
a gaggle of SCHED_RR tasks to be throttled on an otherwise idle box.
Hitting that perturbed test results in an unexpected manner, and sent me
off on a tangent.

	-Mike

--

From: Peter Zijlstra
Date: Friday, August 29, 2008 - 1:06 am

/proc/sys/kernel/sched_rt_{runtime,period}_us don't require SCHED_DEBUG.
If they are in any way non-functional on SCHED_DEBUG=n then that's a
clear bug.

--

From: Mike Galbraith
Date: Friday, August 29, 2008 - 1:47 am

Gee, you're right.  I guess my eyeballs didn't want to see them without
their friends.

	-Mike

--

From: Stefani Seibold
Date: Thursday, August 28, 2008 - 12:39 pm

I started this discussion last week with an apparent bug in the new CFS.

As it turns out, it was not a bug, it was an feature, a (undocumented?)
feature.

In the world of embeded device and real time programming it is not a
hard job to compile the kernel right for the desired usage und fix the
startup script to use the desired policy.

Getting back the old behaviour would be nice and in my opinion the right
way, because the new one breaks with POSIX. But I have a working
solution and that is for me what matters.

By the way - RT means not hard real time. Hard-RT is a marketing phrase.
A given combination of OS and hardware must handle a event in a given
time. Thats all.

Thanks for the support.

Regards,
Stefani, the hard RT "real woman" programmer ;-)


--

From: Alan Cox
Date: Thursday, August 28, 2008 - 1:53 pm

Is there actually a reason we can't have two forms of SCHED_FIFO. For
hard RT the existing behaviour is a lot more useful and it is hard to see

"real man" programmers stare at the code in Zen contemplation and debug
by powercycling - thats one thing even hard RT processes can't beat.

Alan
--

From: Nick Piggin
Date: Friday, August 29, 2008 - 11:33 pm

There is a difference. You *have* to pick some value for those things.
The settings can't necessarily be called correct or incorrect.

The default rt sched policy is definitely "broken" in that it very clearly
changes our previous behaviour, documentation, and what other systems do.

You could say that "realtime" in general is not really a single accepted
definition, but *SCHED_FIFO* and *SCHED_RR* in particular do have a well
defined, simple, and widely accepted definition that is undeniably changed
by this "policy".

Given that a) we can easily introduce new SCHED_xxx policies to implement
the new behaviour, and b) there are quite a few users of this API in this
thread who are concerned about the change, I think it is wisest just to
revert to our old behaviour.

I thought the rule of thumb is "if in doubt, we don't break user APIs".
It's funny that nobody has really answered any of my points of concern.


That's cause you don't care about rt that much. You do care about back
compatibility though so I thought you'd be more interested. Anyway, I won't
post any more.
--

From: Max Krasnyansky
Date: Thursday, August 28, 2008 - 9:33 am

I cannot believe you guys are still arguing about this and calling each 
other stupid/incompetent/braindead and such (not this particular email 
but all the stuff before) :)

Seems to me like leaving RT throttling disabled by default is a 
reasonable compromise. Several people suggested that and the advantage 
is that it does not change the definition of SCHED_FIFO/RR by default.

I personally do not care that much what the default is. If Fedora, for 
example, starts enabling it by default I'll still have to change it. So 
it's not much different from enabled by default in the kernel.

Max


--

From: John Kacur
Date: Thursday, August 28, 2008 - 10:22 am

I'm rather surprised at this whole conversation. I think it is pretty
simple that.
1. The kernel should not set policy but provide capabilities.
a.) It would be more appropriate for a distro to set the policy -. but
even here, the default policy should match the expectation of what
SCHED_FIFO is and standards such as POSIX unless there is a really
really good reason to show why the standard is wrong. (and I haven't
heard it here)
b.) The fact that it is possible to change the settings is an
excellent feature, but that cannot be used as an argument to change
the default settings to something unexpected. Rather, the feature can
be used to change what the standard default is.

2. SCHED_FIFO doesn't have limitations to it, even if the application
programmer can abuse it. That to me seems to be the whole purpose of
SCHED_FIFO - it does let you do things if you have the proper
privileges that a standard kernel protects against, but if the kernel
sets a limitation on it, then it simply isn't SCHED_FIFO anymore, it's
something else. I really dislike this talk about what a good
application programmer should do anyway, I like that we can be
surprised at human creativity and how things can be used in unexpected
ways, so I don't see why that should be throttled. And this argument
about false kernel lock-ups seems bogus to me too.

John
--

From: Peter Zijlstra
Date: Thursday, August 28, 2008 - 9:05 am

No its not per task. Its per group (and trivially the !group case is one
group).

All this bandwidth code comes from RT group scheduling. We do that by
assigning a bandwidth to each group so that within that bandwidth each
group can use RT tasks and have them behave like they should.

I don't fully agree with the statement that the most important thing for
SCHED_FIFO is to run as long as you want.

The most important thing SCHED_FIFO brings us are deterministic
scheduling rules. And RT group scheduling maintains that determinism by
using a constand bandwidth assignment.

Now the thing that we've been bickering about - bandwidth limits on the
root group, which just fell out of the whole ordeal due to symmertry.

On the one hand, a program that ran deterministic will still run
deterministically at n% (although of course, just like running on less
powerfull hardware, you could miss deadlines you previously did not). On
the other hand, people might not expect that.

Having a lower than 100% bandwidth limit by default gives a safer
environment because it avoids total starvation, nor does it take away
determinism [*].

It does however bring the risk of surprising a few folks.

[*] - there is some added jitter due to the throttling logic, and since
the default period might not align nicely with actual deadlines its not
perfect. An EDF based scheduler with <100% bandwidth caps would do
better.

Other scheduling classes have been mentioned... I've been on the point
of writing SCHED_ISO, a bandwidth throttled SCHED_FIFO that doesn't
require root priviligles and comes with say a 10% bandwidth limit.

Doing that should not be too hard - it will just add more code and a
bigger configuration space.



--

From: Steven Rostedt
Date: Thursday, August 28, 2008 - 9:15 am

Does this mean, if I have 100 RT tasks, that will together run for 10secs
secs, they will only run for 9.5secs?

This looks like an even bigger issue. Now we don't have one RT FIFO CPU 
hog, we are now hitting 100 RT FIFO tasks that try to get a bunch done in 
10 secs.

-- Steve

--

From: Peter Zijlstra
Date: Thursday, August 28, 2008 - 9:29 am

Yes.

But say you were doing rate monotonic scheduling (as is not uncommonly
done on top of SCHED_FIFO) then you could not get 100% cpu utilisation
anyway, as RMS has a ~69% utility bound.



--

Previous thread: none

Next thread: [PATCH 0/6] sched: rt-bandwidth fixes by Peter Zijlstra on Tuesday, August 19, 2008 - 3:33 am. (1 message)