So we have the following components to this patch:
- dynamic min_vruntime -- push the min_vruntime ahead at the
rate of the runqueue wide virtual clock. This approximates
the virtual clock, esp. when turning off sleeper fairness.
And is cheaper than actually computing the virtual clock.
It allows for better insertion and re-weighting behaviour,
but it does increase overhead somewhat.
- special wakeups using the next-buddy to get scheduled 'soon',
used by all wakeups from the input system and timers.
- special fork semantics related to those special wakeups.
So while I would love to simply compute the virtual clock, it would add
a s64 mult to every enqueue/dequeue and a s64 div to each
enqueue/re-weight, which might be somewhat prohibitive, the dyn
min_vruntime approximation seems to work well enough and costs a u32 div
per enqueue.
Adding a preference to all user generated wakeups (input) and
propagating that state along the wakeup chain seems to make sense,
adding the same to all timers is something that needs to be discussed, I
can well imagine not all timers are equally important -- do we want to
extend the timer interface?
If we do decide we want both, we should at the very least merge the
try_to_wake_up() conditional blob (they're really identical). Preferably
we should reduce ttwu(), not add more to it...
Fudging fork seems dubious at best, it seems generated by the use of
timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
broken thing to do, it has very ill defined semantics and is utterly
unable to properly cope with error cases. Furthermore its trivial to
actually correctly implement the desired behaviour, so I'm really
skeptical on this front; friends don't let friends use SIGEV_THREAD.
--
SIGEV_THREAD is the best proof that the whole posix timer interface was comitte[e]d under the influence of not to be revealed mind-altering substances. I completely object to add timer specific wakeup magic and support for braindead fork orgies to the kernel proper. All that mess can be fixed in user space by using sensible functionality. Providing support for misdesigned crap just for POSIX compliance reasons and to make some of the blind abusers of that very same crap happy would be a completely stupid decision. In fact that would make a brilliant precedence case for forcing the kernel to solve user space madness at the expense of kernel complexity. If we follow down that road we get requests for extra functionality for AIO, networking and whatever in a split second with no real good reason to reject them anymore. Thanks, tglx --
I really risked eye cancer and digged into the glibc code. /* There is not much we can do if the allocation fails. */ (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td); So if the helper thread which gets the signal fails to create the thread then everything is toast. What about fixing the f*cked up glibc implementation in the first place instead of fiddling in the kernel to support this utter madness? WTF can't the damned delivery thread not be created when timer_create is called and the signal be delivered to that very thread directly via SIGEV_THREAD_ID ? Thanks, tglx --
Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm pretty sure that would work. The only thing we might have to be careful about is what happens if the timer re-fires before the thread completes its execution. We might want to let the signal handler detect these overruns somehow. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Hrm, thinking about it a little more, one of the "plus" sides of these SIGEV_THREAD timers is that a single timer can fork threads that will run on many cores on a multi-core system. If we go for preallocation of a single thread, we lose that. Maybe we could think of a way to preallocate a thread pool instead ? Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Why try and fix a broken thing, just let the app spawn threads and use SIGEV_THREAD_ID itself, it knows its requirements and can do what suits the situation best. No need to try and patch up braindead posix stuff. --
As I dig through the code and learn more about the posix standard, my understanding is that the _implementation_ is broken. The standard just leaves enough rope for the implementation to either do the right thing or to hang itself. Sadly glibc seems to have chosen the second option. Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Why should a single timer fork many threads? Just because a previous thread did not complete before the timer fires again? That's braindamage as all threads call the same function which then needs to be serialized anyway. We really do not need a function which creates tons of threads which get all stuck on the same resource. Thanks, tglx --
It could make sense if the workload is mostly CPU-bound and there is only a very short critical section shared between the threads. But I agree that in many cases this will generate an utter contention mess. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Simply don't use SIGEV_THREAD and spawn you own thread and use SIGEV_THREAD_ID yourself, the programmer knows the semantics and knows if he cares about overlapping timers etc. --
From man timer_create:
SIGEV_THREAD
Upon timer expiration, invoke sigev_notify_function as if it
were the start function of a new thread. (Among the implementa‐
tion possibilities here are that each timer notification could
result in the creation of a new thread, or that a single thread
is created to receive all notifications.) The function is
invoked with sigev_value as its sole argument. If
sigev_notify_attributes is not NULL, it should point to a
pthread_attr_t structure that defines attributes for the new
thread (see pthread_attr_init(3).
So basically, it's the glibc implementation that is broken, not the standard.
The programmer should expect that thread execution can overlap though.
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
The standard is broken too, what context will the new thread inherit? The pthread_attr_t stuff tries to cover some of that, but pthread_attr_t doesn't cover all inherited task attributes, and allows for some very 'interesting' bugs [1]. The specification also doesn't cover the case where the handler takes more time to execute than the timer interval. [1] - consider the case where pthread_attr_t includes the stack and we use a spawn thread on expire policy and then run into the situation where the handler is delayed past the next expiration. --
Besides pthread_attr_t, thinking of the scheduler/cgroups/etc stuff, I'd think it might be expected to inherit the state of the thread which calls timer_create(). But this is not what glibc does right now, and it is not spelled Why should it ? It seems valid for a workload to result in spawning many threads bound to more than a single core on a multi-core system. So concurrency Setting a thread stack and generating the signal more than once is taken into account in the standard. It leads to unspecified result (IOW: don't do this): http://www.opengroup.org/onlinepubs/009695399/functions/timer_create.html "If evp->sigev_sigev_notify is SIGEV_THREAD and sev->sigev_notify_attributes is not NULL, if the attribute pointed to by sev->sigev_notify_attributes has a thread stack address specified by a call to pthread_attr_setstack() or pthread_attr_setstackaddr(), the results are unspecified if the signal is generated more than once." Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
The problem with allowing concurrency is that the moment you want to do that you get the spawner context and error propagation problems. Thus we're limited to spawning a single thread at timer_create() and handling expirations in there. At that point you have to specify what happens to an expiration while the handler is running, will it queue handlers or will you have to carefully craft your handler using timer_getoverrun(). But I really don't understand why POSIX would provide this composite functionality instead of providing the separate bits to implement this, which I think is only SIGEV_THREAD_ID which is missing. --
These are problems you get only when you allow spawning any number of threads. If, instead, you create a thread pool at timer creation, then you can allow With my statement above in mind, it's true that we'd have to handle what happens when the thread pool is exhausted. But couldn't we simply increment an overrun counter from userland ? (It might be a different counter than kernel space kept Higher abstraction API vs low-level control. A long running battle. ;) Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
That would be a massive resource waste for the normal case where the interval > handler runtime. --
Not if the application can configure this value. So the "normal" case could be set to 1 single thread. Only resource-intensive apps would set this differently, e.g. to the detected number of cpus. Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
.. but as we discussed in private, this would involve adding extra attribute fields to what the standard proposes. I think the problem comes from the fact that POSIX considers the pthread attributes to contain every possible thread attribute one can imagine, which does not take into account the internal kernel state set by other interfaces. So this leaves us with a single-threaded SIGEV_THREAD, which is pretty much useless. But at least it is not totally impossible for glibc to get it right by moving to an implementation which uses only one thread. So until one finds the time to work on fixing glibc SIGEV_THREAD timers, I would strongly recommend against using it. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
C'mon, Thomas!!! That is entirely too sensible!!! ;-) But if you are going to create the thread at timer_create() time, why not just have the new thread block for the desired duration? Thanx, Paul --
The timer infrastructure allows things like periodic timer which restarts when it fires, detection of missed timer events, etc. If you try doing this in a userland thread context with a simple sleep, then your period becomes however long you sleep for _and_ the thread execution time. This is all in all quite different from the timer semantic. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Hmmm... Why couldn't the thread in question set the next sleep time based on the timer period? Yes, if the function ran for longer than the period, there would be a delay, but the POSIX semantics allow such a delay, right? Thanx, Paul --
I'm afraid you'll have a large error accumulation over time, and getting the precise picture of how much time between now and where the the period end is expected to be is kind of hard to do precisely from user-space. In a few words, this solution would be terrible for jitter. This is why we usually rely on timers rather than delays in these periodic workloads. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Why couldn't the timer_create() call record the start time, and then compute the sleeps from that time? So if timer_create() executed at time t=100 and the period is 5, upon awakening and completing the first invocation of the function in question, the thread does a sleep calculated to wake at t=110. Thanx, Paul --
Let's focus on the userspace thread execution, right between the samping of the current time and the call to sleep: Thread A current_time = read current time(); sleep(period_end - current_time); If the thread is preempted between these two operations, then we end up sleeping for longer than what is needed. This kind of imprecision will add up over time, so that after e.g. one day, instead of having the expected number of timer executions, we'll have less than that. This kind of accumulated drift is an unwanted side-effect of using delays in lieue of real periodic timers. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Nonsense, that's why we provide clock_nanosleep(ABSTIME) Thanks, tglx --
If we're using CLOCK_MONOTONIC, you're right, this could work. I was only thinking of relative delays. So do you think Paul's ideal would be a good candidate for the timer_create SIGEV_THREAD glibc implementation then ? Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Simply do not _ever_ use SIGEV_THREAD, there really is no reason to. --
Yep, it's cheap enough and seems to work very well as far as my testing have Yes, this is what lets us kill FAIR_SLEEPERS (and thus let the dynamic I just thought it made sense that when a timer fires and wakes up a thread, there are pretty good chances that we might to wakeup this thread quickly. But it brings the question in a more general sense: would we want this kind of behavior also available for network packets, disk I/O, etc ? IOW, would it make sense to have next-buddy selection on all these input-triggered wakeups ? Since we're only using the next buddy selection, it will only perform this selection if the selected buddy is within a specific vruntime range from the minimum, so AFAIK, I don't think we would end up starving the system in any possible way. So far I cannot see a situation where selecting the next buddy would _not_ make sense in any kind of input-driven wakeups (interactive, timer, disk, network, Sure. Or maybe find out we want to target even more input paths... so far interactive input seemed absolutely needed, timers seemed logical and self-rate-limited. For the others, I don't know. It might actually make sense Agreed for the timer-tied case. I think the direction Thomas proposes for fixing glibc makes much more sense. However, I am wondering about the interactive case too: e.g., if you click on "open terminal", it needs to fork a process and you would normally expect this to come up more quickly than the background kernel compile you're doing. So there might be some interest for this fork vruntime boost for interactivity-driven wakeups. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
The risk is that you end up with always using next-buddy, and we tried that a while back and that didn't work well for some, Mike might remember. Also, when you use timers things like time-outs you really couldn't care less if its handled sooner rather than later. Disk is usually so slow you really don't want to consider it interactive, but then sometimes you might,.. its a really hard problem. The only clear situation is the direct input, that's a direct link between the user and our wakeup chain and the user is always important. --
I turned it off because it was ripping spread apart badly, and last Yeah, directly linked wakeups using next could be a good thing, but the trouble with using any linkage to the user is that you have to pass it on to reap benefit.. so when do you disconnect? -Mike --
Maybe with the dyn min_vruntime feature proposed in this patchset we should Maybe we could have a timer delay threshold under which we consider the wakeup to be "short" or "long". e.g. a timer firing every ms is very likely to need its wakups to proceed quickly, but if the timer fires every hours, we could not care What we do here is that we pass it on across wakeups, but not across wakups of new threads (so it's not passed across forks). However, AFAIU, this only generates a bias that will select the wakeup target as next buddy IIF it is within a certain range from the leftmost entity. The disconnect is done (setting se.interactive to 0) on a per-thread basis when they go to sleep. I've seen some version of Peter's patches which were decrementing a counter as the token is passed across a wakeup, but I did not find it necessary so far. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
I'm curious: which workload was showing this kind of problem exactly ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Dunno. Messing with spread is a very sharp double edged sword.
I took the patch set out for a spin, and saw some negative effects in
that regard. I did see some positive effects as well though, x264 for
example really wants round robin, so profits. Things where preemption
translates to throughput don't care for the idea much. vmark rather
surprised me, it hated this feature for some reason, I expected the
opposite.
It's an interesting knob, but I wouldn't turn it on by default on
anything but maybe a UP desktop box.
(i kinda like cgroups classified by pgid for desktop interactivity under
Hm, I don't recall exact details. I was looking at a lot of different
load mixes at the time, mostly interactive and batch, trying to shrink
the too darn high latencies seen with modest load mixes.
I never tried to figure out why next buddy had worse effect on spread
than last buddy (not obvious to me), just noted the fact that it did. I
recall that it had a large negative effect on x264 throughput as well.
-Mike
some numbers:
35.3x = DYN_MIN_VRUNTIME
netperf TCP_RR
35.3 97624.82 96982.62 97387.50 avg 97331.646 RR/sec 1.000
35.3x 95044.98 95365.52 94581.92 avg 94997.473 RR/sec .976
tbench 8
35.3 1200.36 1200.56 1199.29 avg 1200.070 MB/sec 1.000
35.3x 1106.92 1110.50 1106.58 avg 1108.000 MB/sec .923
x264 8
35.3 407.12 408.80 414.60 avg 410.173 fps 1.000
35.3x 428.07 436.43 438.16 avg 434.220 fps 1.058
vmark
35.3 149678 149269 150584 avg 149843.666 m/sec 1.000
35.3x 120872 120932 121247 avg 121017.000 m/sec .807
mysql+oltp
1 2 4 8 16 32 64 128 256
35.3 10956.33 20747.86 37139.27 36898.70 36575.90 36104.63 34390.26 31574.46 29148.01
...It's RFC, features are nice to test things and see if/how they work. Features are not a user option, so you don't have to worry about them. Also, 'better' is a very hard thing to quantify. Some people like throughput, some like latency, and others like raw context switch performance. Hopefully we'll soon have a non privileged sporadic task scheduler for all our srt media needs. --
Hello, Please don't hide scheduler improvements behind obscure CONFIG_SCHED_DEBUG options. If it doesn't make the scheduler better just don't merge it. If it does then it should be enabled by default. Thank you, Indan P.S. Tony Luck seems to be chopped of the CC list. --
