Re: [RFC PATCH 00/11] sched: CFS low-latency features

Previous thread: [RFC PATCH 04/11] sched: debug cleanup place entity by Mathieu Desnoyers on Thursday, August 26, 2010 - 11:09 am. (1 message)

Next thread: [PATCH] drivers/net/: ll_temac_main.c & ll_temac.h: Adding ethtool interface by Ville Sundell on Thursday, August 26, 2010 - 11:13 am. (2 messages)
From: Mathieu Desnoyers
Date: Thursday, August 26, 2010 - 11:09 am

[Empty message]
From: Peter Zijlstra
Date: Thursday, August 26, 2010 - 11:57 am

So we have the following components to this patch:

  - dynamic min_vruntime -- push the min_vruntime ahead at the
       rate of the runqueue wide virtual clock. This approximates
       the virtual clock, esp. when turning off sleeper fairness.
       And is cheaper than actually computing the virtual clock.

       It allows for better insertion and re-weighting behaviour,
       but it does increase overhead somewhat.

  - special wakeups using the next-buddy to get scheduled 'soon',
       used by all wakeups from the input system and timers.

  - special fork semantics related to those special wakeups.


So while I would love to simply compute the virtual clock, it would add
a s64 mult to every enqueue/dequeue and a s64 div to each
enqueue/re-weight, which might be somewhat prohibitive, the dyn
min_vruntime approximation seems to work well enough and costs a u32 div
per enqueue.

Adding a preference to all user generated wakeups (input) and
propagating that state along the wakeup chain seems to make sense,
adding the same to all timers is something that needs to be discussed, I
can well imagine not all timers are equally important -- do we want to
extend the timer interface?

If we do decide we want both, we should at the very least merge the
try_to_wake_up() conditional blob (they're really identical). Preferably
we should reduce ttwu(), not add more to it... 

Fudging fork seems dubious at best, it seems generated by the use of
timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
broken thing to do, it has very ill defined semantics and is utterly
unable to properly cope with error cases. Furthermore its trivial to
actually correctly implement the desired behaviour, so I'm really
skeptical on this front; friends don't let friends use SIGEV_THREAD.


--

From: Thomas Gleixner
Date: Thursday, August 26, 2010 - 2:25 pm

SIGEV_THREAD is the best proof that the whole posix timer interface
was comitte[e]d under the influence of not to be revealed
mind-altering substances.

I completely object to add timer specific wakeup magic and support for
braindead fork orgies to the kernel proper. All that mess can be fixed
in user space by using sensible functionality.

Providing support for misdesigned crap just for POSIX compliance
reasons and to make some of the blind abusers of that very same crap
happy would be a completely stupid decision.

In fact that would make a brilliant precedence case for forcing the
kernel to solve user space madness at the expense of kernel
complexity. If we follow down that road we get requests for extra
functionality for AIO, networking and whatever in a split second with
no real good reason to reject them anymore.

Thanks,

	tglx
--

From: Thomas Gleixner
Date: Thursday, August 26, 2010 - 3:22 pm

I really risked eye cancer and digged into the glibc code.

	/* There is not much we can do if the allocation fails.  */
	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);

So if the helper thread which gets the signal fails to create the
thread then everything is toast.

What about fixing the f*cked up glibc implementation in the first place
instead of fiddling in the kernel to support this utter madness? 

WTF can't the damned delivery thread not be created when timer_create
is called and the signal be delivered to that very thread directly via
SIGEV_THREAD_ID ?

Thanks,

	tglx
--

From: Mathieu Desnoyers
Date: Thursday, August 26, 2010 - 4:09 pm

Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
pretty sure that would work.

The only thing we might have to be careful about is what happens if the timer
re-fires before the thread completes its execution. We might want to let the
signal handler detect these overruns somehow.

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Mathieu Desnoyers
Date: Thursday, August 26, 2010 - 4:36 pm

Hrm, thinking about it a little more, one of the "plus" sides of these
SIGEV_THREAD timers is that a single timer can fork threads that will run on
many cores on a multi-core system. If we go for preallocation of a single
thread, we lose that. Maybe we could think of a way to preallocate a thread pool
instead ?

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Peter Zijlstra
Date: Friday, August 27, 2010 - 12:38 am

Why try and fix a broken thing, just let the app spawn threads and use
SIGEV_THREAD_ID itself, it knows its requirements and can do what suits
the situation best. No need to try and patch up braindead posix stuff.


--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 8:23 am

As I dig through the code and learn more about the posix standard, my
understanding is that the _implementation_ is broken. The standard just leaves
enough rope for the implementation to either do the right thing or to hang
itself. Sadly glibc seems to have chosen the second option.

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Thomas Gleixner
Date: Friday, August 27, 2010 - 1:43 am

Why should a single timer fork many threads? Just because a previous
thread did not complete before the timer fires again? That's
braindamage as all threads call the same function which then needs to
be serialized anyway. We really do not need a function which creates
tons of threads which get all stuck on the same resource.

Thanks,

	tglx
--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 8:50 am

It could make sense if the workload is mostly CPU-bound and there is only a very
short critical section shared between the threads. But I agree that in many
cases this will generate an utter contention mess.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Peter Zijlstra
Date: Friday, August 27, 2010 - 12:37 am

Simply don't use SIGEV_THREAD and spawn you own thread and use
SIGEV_THREAD_ID yourself, the programmer knows the semantics and knows
if he cares about overlapping timers etc.


--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 8:21 am

From man timer_create:

       SIGEV_THREAD
              Upon  timer  expiration,  invoke  sigev_notify_function as if it
              were the start function of a new thread.  (Among the implementa‐
              tion  possibilities  here are that each timer notification could
              result in the creation of a new thread, or that a single  thread
              is  created  to  receive  all  notifications.)   The function is
              invoked   with   sigev_value   as   its   sole   argument.    If
              sigev_notify_attributes  is  not  NULL,  it  should  point  to a
              pthread_attr_t structure that defines  attributes  for  the  new
              thread (see pthread_attr_init(3).

So basically, it's the glibc implementation that is broken, not the standard.

The programmer should expect that thread execution can overlap though.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Peter Zijlstra
Date: Friday, August 27, 2010 - 8:41 am

The standard is broken too, what context will the new thread inherit?
The pthread_attr_t stuff tries to cover some of that, but pthread_attr_t
doesn't cover all inherited task attributes, and allows for some very
'interesting' bugs [1].

The specification also doesn't cover the case where the handler takes
more time to execute than the timer interval.

[1] - consider the case where pthread_attr_t includes the stack and we
use a spawn thread on expire policy and then run into the situation
where the handler is delayed past the next expiration.
--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 9:09 am

Besides pthread_attr_t, thinking of the scheduler/cgroups/etc stuff, I'd think
it might be expected to inherit the state of the thread which calls
timer_create(). But this is not what glibc does right now, and it is not spelled


Why should it ? It seems valid for a workload to result in spawning many threads
bound to more than a single core on a multi-core system. So concurrency

Setting a thread stack and generating the signal more than once is taken into
account in the standard. It leads to unspecified result (IOW: don't do this):

http://www.opengroup.org/onlinepubs/009695399/functions/timer_create.html

"If evp->sigev_sigev_notify is SIGEV_THREAD and sev->sigev_notify_attributes is
not NULL, if the attribute pointed to by sev->sigev_notify_attributes has a
thread stack address specified by a call to pthread_attr_setstack() or
pthread_attr_setstackaddr(), the results are unspecified if the signal is
generated more than once."

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Peter Zijlstra
Date: Friday, August 27, 2010 - 10:27 am

The problem with allowing concurrency is that the moment you want to do
that you get the spawner context and error propagation problems.

Thus we're limited to spawning a single thread at timer_create() and
handling expirations in there. At that point you have to specify what
happens to an expiration while the handler is running, will it queue
handlers or will you have to carefully craft your handler using
timer_getoverrun().

But I really don't understand why POSIX would provide this composite
functionality instead of providing the separate bits to implement this,
which I think is only SIGEV_THREAD_ID which is missing.


--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 11:32 am

These are problems you get only when you allow spawning any number of threads.
If, instead, you create a thread pool at timer creation, then you can allow

With my statement above in mind, it's true that we'd have to handle what happens
when the thread pool is exhausted. But couldn't we simply increment an overrun
counter from userland ? (It might be a different counter than kernel space kept

Higher abstraction API vs low-level control. A long running battle. ;)

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Peter Zijlstra
Date: Friday, August 27, 2010 - 12:23 pm

That would be a massive resource waste for the normal case where the
interval > handler runtime.


--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 12:57 pm

Not if the application can configure this value. So the "normal" case could be
set to 1 single thread. Only resource-intensive apps would set this differently,
e.g. to the detected number of cpus.

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Mathieu Desnoyers
Date: Tuesday, August 31, 2010 - 8:02 am

.. but as we discussed in private, this would involve adding extra attribute
fields to what the standard proposes. I think the problem comes from the fact
that POSIX considers the pthread attributes to contain every possible thread
attribute one can imagine, which does not take into account the internal kernel
state set by other interfaces.

So this leaves us with a single-threaded SIGEV_THREAD, which is pretty much
useless. But at least it is not totally impossible for glibc to get it right by
moving to an implementation which uses only one thread.

So until one finds the time to work on fixing glibc SIGEV_THREAD timers, I would
strongly recommend against using it.

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Paul E. McKenney
Date: Thursday, August 26, 2010 - 4:18 pm

C'mon, Thomas!!!  That is entirely too sensible!!!  ;-)

But if you are going to create the thread at timer_create() time,
why not just have the new thread block for the desired duration?

							Thanx, Paul
--

From: Mathieu Desnoyers
Date: Thursday, August 26, 2010 - 4:28 pm

The timer infrastructure allows things like periodic timer which restarts when
it fires, detection of missed timer events, etc. If you try doing this in a
userland thread context with a simple sleep, then your period becomes however
long you sleep for _and_ the thread execution time. This is all in all quite
different from the timer semantic.

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Paul E. McKenney
Date: Thursday, August 26, 2010 - 4:38 pm

Hmmm...  Why couldn't the thread in question set the next sleep time based
on the timer period?  Yes, if the function ran for longer than the period,
there would be a delay, but the POSIX semantics allow such a delay, right?

							Thanx, Paul
--

From: Mathieu Desnoyers
Date: Thursday, August 26, 2010 - 4:53 pm

I'm afraid you'll have a large error accumulation over time, and getting the
precise picture of how much time between now and where the the period end is
expected to be is kind of hard to do precisely from user-space. In a few words,
this solution would be terrible for jitter. This is why we usually rely on
timers rather than delays in these periodic workloads.

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Paul E. McKenney
Date: Thursday, August 26, 2010 - 5:09 pm

Why couldn't the timer_create() call record the start time, and then
compute the sleeps from that time?  So if timer_create() executed at
time t=100 and the period is 5, upon awakening and completing the first
invocation of the function in question, the thread does a sleep calculated
to wake at t=110.

							Thanx, Paul
--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 8:18 am

Let's focus on the userspace thread execution, right between the samping of the
current time and the call to sleep:

  Thread A
  current_time = read current time();
  sleep(period_end - current_time);

If the thread is preempted between these two operations, then we end up sleeping
for longer than what is needed. This kind of imprecision will add up over time,
so that after e.g. one day, instead of having the expected number of timer
executions, we'll have less than that. This kind of accumulated drift is an
unwanted side-effect of using delays in lieue of real periodic timers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Thomas Gleixner
Date: Friday, August 27, 2010 - 8:20 am

Nonsense, that's why we provide clock_nanosleep(ABSTIME)

Thanks,

	tglx
--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 8:30 am

If we're using CLOCK_MONOTONIC, you're right, this could work. I was only
thinking of relative delays.

So do you think Paul's ideal would be a good candidate for the timer_create
SIGEV_THREAD glibc implementation then ?

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Peter Zijlstra
Date: Friday, August 27, 2010 - 8:41 am

Simply do not _ever_ use SIGEV_THREAD, there really is no reason to.


--

From: Mathieu Desnoyers
Date: Thursday, August 26, 2010 - 4:49 pm

Yep, it's cheap enough and seems to work very well as far as my testing have

Yes, this is what lets us kill FAIR_SLEEPERS (and thus let the dynamic

I just thought it made sense that when a timer fires and wakes up a thread,
there are pretty good chances that we might to wakeup this thread quickly. But
it brings the question in a more general sense: would we want this kind of
behavior also available for network packets, disk I/O, etc ? IOW, would it make
sense to have next-buddy selection on all these input-triggered wakeups ? Since
we're only using the next buddy selection, it will only perform this selection
if the selected buddy is within a specific vruntime range from the minimum, so
AFAIK, I don't think we would end up starving the system in any possible way.

So far I cannot see a situation where selecting the next buddy would _not_ make
sense in any kind of input-driven wakeups (interactive, timer, disk, network,

Sure. Or maybe find out we want to target even more input paths... so far
interactive input seemed absolutely needed, timers seemed logical and
self-rate-limited. For the others, I don't know. It might actually make sense

Agreed for the timer-tied case. I think the direction Thomas proposes for fixing
glibc makes much more sense. However, I am wondering about the interactive case
too: e.g., if you click on "open terminal", it needs to fork a process and you
would normally expect this to come up more quickly than the background kernel
compile you're doing. So there might be some interest for this fork vruntime
boost for interactivity-driven wakeups. Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Peter Zijlstra
Date: Friday, August 27, 2010 - 12:42 am

The risk is that you end up with always using next-buddy, and we tried
that a while back and that didn't work well for some, Mike might
remember.

Also, when you use timers things like time-outs you really couldn't care
less if its handled sooner rather than later.

Disk is usually so slow you really don't want to consider it
interactive, but then sometimes you might,.. its a really hard problem.

The only clear situation is the direct input, that's a direct link
between the user and our wakeup chain and the user is always important.


--

From: Mike Galbraith
Date: Friday, August 27, 2010 - 1:19 am

I turned it off because it was ripping spread apart badly, and last


Yeah, directly linked wakeups using next could be a good thing, but the
trouble with using any linkage to the user is that you have to pass it
on to reap benefit.. so when do you disconnect?

	-Mike

--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 8:43 am

Maybe with the dyn min_vruntime feature proposed in this patchset we should

Maybe we could have a timer delay threshold under which we consider the wakeup
to be "short" or "long". e.g. a timer firing every ms is very likely to need its
wakups to proceed quickly, but if the timer fires every hours, we could not care

What we do here is that we pass it on across wakeups, but not across wakups of
new threads (so it's not passed across forks). However, AFAIU, this only
generates a bias that will select the wakeup target as next buddy IIF it is
within a certain range from the leftmost entity. The disconnect is done (setting
se.interactive to 0) on a per-thread basis when they go to sleep.

I've seen some version of Peter's patches which were decrementing a counter as
the token is passed across a wakeup, but I did not find it necessary so far.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Mathieu Desnoyers
Date: Friday, August 27, 2010 - 11:38 am

I'm curious: which workload was showing this kind of problem exactly ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Mike Galbraith
Date: Saturday, August 28, 2010 - 12:33 am

Dunno.  Messing with spread is a very sharp double edged sword.

I took the patch set out for a spin, and saw some negative effects in
that regard.  I did see some positive effects as well though, x264 for
example really wants round robin, so profits.  Things where preemption
translates to throughput don't care for the idea much.  vmark rather
surprised me, it hated this feature for some reason, I expected the
opposite.

It's an interesting knob, but I wouldn't turn it on by default on
anything but maybe a UP desktop box.

(i kinda like cgroups classified by pgid for desktop interactivity under

Hm, I don't recall exact details.  I was looking at a lot of different
load mixes at the time, mostly interactive and batch, trying to shrink
the too darn high latencies seen with modest load mixes.

I never tried to figure out why next buddy had worse effect on spread
than last buddy (not obvious to me), just noted the fact that it did.  I
recall that it had a large negative effect on x264 throughput as well.

	-Mike

some numbers:

35.3x = DYN_MIN_VRUNTIME

netperf TCP_RR
35.3              97624.82   96982.62   97387.50   avg 97331.646 RR/sec  1.000
35.3x             95044.98   95365.52   94581.92   avg 94997.473 RR/sec   .976

tbench 8
35.3               1200.36    1200.56    1199.29   avg 1200.070 MB/sec   1.000
35.3x              1106.92    1110.50    1106.58   avg 1108.000 MB/sec    .923

x264 8
35.3                407.12     408.80     414.60   avg 410.173 fps       1.000
35.3x               428.07     436.43     438.16   avg 434.220 fps       1.058

vmark
35.3                149678     149269     150584   avg 149843.666 m/sec  1.000
35.3x               120872     120932     121247   avg 121017.000 m/sec   .807

mysql+oltp
                      1          2          4          8         16         32         64        128        256
35.3           10956.33   20747.86   37139.27   36898.70   36575.90   36104.63   34390.26   31574.46   29148.01
               ...
From: Peter Zijlstra
Date: Friday, August 27, 2010 - 3:58 am

It's RFC, features are nice to test things and see if/how they work.
Features are not a user option, so you don't have to worry about them.

Also, 'better' is a very hard thing to quantify. Some people like
throughput, some like latency, and others like raw context switch
performance.

Hopefully we'll soon have a non privileged sporadic task scheduler for
all our srt media needs.
--

From: Indan Zupancic
Date: Friday, August 27, 2010 - 3:47 am

Hello,


Please don't hide scheduler improvements behind obscure CONFIG_SCHED_DEBUG
options. If it doesn't make the scheduler better just don't merge it. If it
does then it should be enabled by default.

Thank you,

Indan


P.S. Tony Luck seems to be chopped of the CC list.


--

Previous thread: [RFC PATCH 04/11] sched: debug cleanup place entity by Mathieu Desnoyers on Thursday, August 26, 2010 - 11:09 am. (1 message)

Next thread: [PATCH] drivers/net/: ll_temac_main.c & ll_temac.h: Adding ethtool interface by Ville Sundell on Thursday, August 26, 2010 - 11:13 am. (2 messages)