I noticed that my network performance has gone down from 2.6.22 from [ 3] 0.0-10.0 sec 113 MBytes 95.0 Mbits/sec to [ 3] 0.0-10.0 sec 75.7 MBytes 63.3 Mbits/sec with 2.6.23-rc1 (and 2.6.23-rc8), as measured with iperf. I did a git bisect today and tracked it back to the commit where CFS was enabled ("sched: cfs core code; apply the CFS core code", commit dd41f596cda0d7d6e4a8b139ffdfabcefdd46528). I also compiled a kernel from git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git but things don't improve. This is on a Thecus N2100, an ARM (Intel IOP32x) based storage device with a r8169 card, SATA disks and 512 MB RAM. My config is attached. What kind of information can I supply so you can track this down? -- Martin Michlmayr http://www.cyrius.com/
as a starter, could you boot the sched-devel.git kernel, with CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y enabled and could you run this script while the iperf test is in the middle of its testrun: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh this will gather a good deal of info about the workload in question. Please send me the resulting debug file. Ingo -
Another thing: please also do the same with the vanilla v2.6.22 kernel, and send me that file too. (so that the two cases can be compared) Ingo -
I put the log files here: http://www.cyrius.com/tmp/2.6.22 http://www.cyrius.com/tmp/2.6.23-rc8-sched-devel I increased the time ipfer ran to 30 secs since your script runs for over 15 secs. I got: [ 3] 0.0-30.0 sec 331 MBytes 92.6 Mbits/sec 2.6.22 vs [ 3] 0.0-30.0 sec 222 MBytes 62.1 Mbits/sec 2.6.23-rc8-sched-devel -- Martin Michlmayr http://www.cyrius.com/ -
thanks! the test does almost no context switches: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 0 462928 3280 37216 0 0 137 75 2306 83 13 38 47 2 2 0 0 462928 3280 37216 0 0 0 0 8600 54 6 94 0 0 2 0 0 462928 3280 37216 0 0 0 36 8667 55 7 93 0 0 2 0 0 462928 3280 37216 0 0 0 0 8592 53 5 95 0 0 2 0 0 462928 3280 37216 0 0 0 0 8638 52 7 93 0 0 (the 'cs' column shows 50-80 context switches per second.) so there must be some other side-effect, not raw scheduling overhead or some other direct scheduler performance problem. Ingo -
FWIW, on my box blasting localhost, I see the opposite, repeatably. root@Homer: iperf -v iperf version 2.0.2 (03 May 2005) pthreads 2.6.22.1-smp for i in `seq 1 3`; do iperf -c localhost; done >> /xx ------------------------------------------------------------ Client connecting to localhost, TCP port 5001 TCP window size: 49.2 KByte (default) ------------------------------------------------------------ [ 3] local 127.0.0.1 port 21383 connected with 127.0.0.1 port 5001 [ 3] 0.0-10.1 sec 518 MBytes 432 Mbits/sec ------------------------------------------------------------ Client connecting to localhost, TCP port 5001 TCP window size: 49.2 KByte (default) ------------------------------------------------------------ [ 3] local 127.0.0.1 port 21384 connected with 127.0.0.1 port 5001 [ 3] 0.0-10.0 sec 325 MBytes 273 Mbits/sec ------------------------------------------------------------ Client connecting to localhost, TCP port 5001 TCP window size: 49.2 KByte (default) ------------------------------------------------------------ [ 3] local 127.0.0.1 port 21385 connected with 127.0.0.1 port 5001 [ 3] 0.0-10.0 sec 434 MBytes 363 Mbits/sec 2.6.23-rc8-smp-d for i in `seq 1 3`; do iperf -c localhost; done >> /xx ------------------------------------------------------------ Client connecting to localhost, TCP port 5001 TCP window size: 49.2 KByte (default) ------------------------------------------------------------ [ 3] local 127.0.0.1 port 11650 connected with 127.0.0.1 port 5001 [ 3] 0.0-10.0 sec 2.01 GBytes 1.72 Gbits/sec ------------------------------------------------------------ Client connecting to localhost, TCP port 5001 TCP window size: 49.2 KByte (default) ------------------------------------------------------------ [ 3] local 127.0.0.1 port 11651 connected with 127.0.0.1 port 5001 [ 3] 0.0-10.0 sec 2.02 GBytes 1.74 Gbits/sec ------------------------------------------------------------ Client connecting to localhost, TCP port ...
I noticed on the iperf website a patch which contains sched_yield(). http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt Do you have that patch applied by any chance? If so, it might be a worth while to try it without it. -Mike -
Yes, this patch was applied. When I revert it, I get the same (high) performance with both kernels. -- Martin Michlmayr http://www.cyrius.com/ -
great! Could you try this too: echo 1 > /proc/sys/kernel/sched_compat_yield does it fix iperf performance too (with the yield patch applied to iperf)? I think the real fix would be for iperf to use blocking network IO though, or maybe to use a POSIX mutex or POSIX semaphores. Ingo -
So it's definitely not a bug in the kernel, only in iperf? (CCing Stephen Hemminger who wrote the iperf patch.) -- Martin Michlmayr http://www.cyrius.com/ -
Martin: Actually, in this case I think iperf is doing the right thing (though not the best thing) and the kernel is doing the wrong thing. It's calling 'sched_yield' to ensure that every other thread gets a chance to run before the current thread runs again. CFS is not doing that, allowing the yielding thread to hog the CPU to the exclusion of the other threads. (It can allow the yielding thread to hog the CPU, of course, just not to the exclusion of other threads.) It's still better to use some kind of rational synchronization primitive (like mutex/sempahore) so that the other threads can tell you when there's something for you to do. It's still better to use blocking network IO, so the kernel will let you know exactly when to try I/O and your dynamic priority can rise. Ingo: Can you clarify what CFS' current default sched_yield implementation is and what setting sched_compat_yield to 1 does? Which way do we get the right semantics (all threads of equal static priority are scheduled, with some possible SMP fuzziness, before this thread is scheduled again)? DS -
it's not doing the right thing at all. I had a quick look at the source code, and the reason for that weird yield usage was that there's a locking bug in iperf's "Reporter thread" abstraction and apparently instead of fixing the bug it was worked around via a horrible yield() based user-space lock. the (small) patch below fixes the iperf locking bug and removes the yield() use. There are numerous immediate benefits of this patch: - iperf uses _much_ less CPU time. On my Core2Duo test system, before the patch it used up 100% CPU time to saturate 1 gigabit of network traffic to another box. With the patch applied it now uses 9% of CPU time. - sys_sched_yield() is removed altogether - i was able to measure much higher bandwidth over localhost for example. This is the case for over-the-network measurements as well. - the results are also more consistent and more deterministic, hence more reliable as a benchmarking tool. (the reason for that is that more CPU time is spent on actually delivering packets, instead of mindlessly polling on the user-space "lock", so we actually max out the CPU, instead of relying on the random proportion the workload was able to make progress versus wasting CPU time on polling.) sched_yield() is almost always the symptom of broken locking or other bug. In that sense CFS does the right thing by exposing such bugs =B-) Ingo -------------------------> Subject: iperf: fix locking From: Ingo Molnar <mingo@elte.hu> fix iperf locking - it was burning CPU time while polling unnecessarily, instead of using the proper wait primitives. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- compat/Thread.c | 3 --- src/Reporter.c | 13 +++++++++---- src/main.cpp | 2 ++ 3 files changed, 11 insertions(+), 7 deletions(-) Index: iperf-2.0.2/compat/Thread.c =================================================================== --- iperf-2.0.2.orig/compat/Thread.c +++ ...
On Wed, 26 Sep 2007 15:31:38 +0200 A similar patch has already been submitted, since BSD wouldn't work without it. -
Here is the combined fixes from iperf-users list. Begin forwarded message: Date: Thu, 30 Aug 2007 15:55:22 -0400 From: "Andrew Gallatin" <gallatin@gmail.com> To: iperf-users@dast.nlanr.net Subject: [PATCH] performance fixes for non-linux Hi, I've attached a patch which gives iperf similar performance to netperf on my FreeBSD, MacOSX and Solaris hosts. It does not seem to negatively impact Linux. I only started looking at the iperf source yesterday, so I don't really expect this to be integrated as is, but a patch is worth a 1000 words :) Background: On both Solaris and FreeBSD, there are 2 things slowing iperf down: The gettimeofday timestamp around each socket read/write is terribly expensive, and the sched_yield() or usleep(0) causes iperf to take 100% of the time (system time on BSD, split user/system time on Solaris and MacOSX), which slows things down and confuses the scheduler. To address the gettimeofday() issue, I treat TCP different than UDP, and TCP tests behave as though only a single (huge) packet was sent. Rather then ending the test based on polling gettimeofday() timestamps, an interval timer / sigalarm handler is used. I had to increase the packetLen from an int to a max_size_t. To address the sched_yield/usleep issue, I put the reporter thread to sleep on a condition variable. For the TCP tests at least, there is no reason to have it running during the test and it is best to just get it out of the way rather than burning CPU in a tight loop. I've also incorporated some fixes from the FreeBSD ports collection: --- include/headers.h use a 64-bit type for max_size_t --- compat/Thread.c oldTID is not declared anywhere. Make this compile (seems needed for at least FreeBSD & MacOSX) --- src/Client.cpp BSDs can return ENOBUFS during a UDP test when the socket buffer fills. Don't exit when this happens. I've run the resulting iperf on FreeBSD, Solaris, MacOSX and Linux, and it seems to work for me. It is nice not to have a 100% ...
...Only if it were under some DEBUG option. Even if iperf is doing the wrong thing there is no explanation for such big difference in the behavior between sched_compat_yield 1 vs. 0. It seems common interfaces should work similarly and predictably on various systems, and here, if I didn't miss something, linux looks like a different kind? Regards, Jarek P. -
note that i qualified my sentence both via "In that sense" and via a smiley! So i was not suggesting that this is a general rule at all and i What you missed is that there is no such thing as "predictable yield behavior" for anything but SCHED_FIFO/RR tasks (for which tasks CFS does keep the behavior). Please read this thread on lkml for a more detailed background: CFS: some bad numbers with Java/database threading [FIXED] http://lkml.org/lkml/2007/9/19/357 http://lkml.org/lkml/2007/9/19/328 in short: the yield implementation was tied to the O(1) scheduler, so the only way to have the exact same behavior would be to have the exact same core scheduler again. If what you said was true we would not be able to change the scheduler, ever. For something as vaguely defined of an API as yield, there's just no way to have a different core scheduler and still behave the same way. So _generally_ i'd agree with you that normally we want to be bug for bug compatible, but in this specific (iperf) case there's just no point in preserving behavior that papers over this _clearly_ broken user-space app/thread locking (for which now two fixes exist already, plus a third fix is the twiddling of that sysctl). Ingo -
Actually, I've analyzed this smiley for some time but these scheduler OK, but let's forget about fixing iperf. Probably I got this wrong, but I've thought this "bad" iperf patch was tested on a few nixes and linux was the most different one. The main point is: even if there is no standard here, it should be a common interest to try to not differ too much at least. So, it's not about exactness, but 50% (63 -> 95) change in linux own 'definition' after upgrading seems to be a lot. So, IMHO, maybe some 'compatibility' test could be prepared to compare a few different ideas on this yield and some average value could be a kind of at least linux' own standard, which should be emulated within some limits by next kernels? Thanks, Jarek P. -
you repeat your point of "emulating yield", and i can only repeat my
point that you should please read this:
http://lkml.org/lkml/2007/9/19/357
because, once you read that, i think you'll agree with me that what you
say is simply not possible in a sane way at this stage. We went through
a number of yield implementations already and each will change behavior
for _some_ category of apps. So right now we offer two implementations,
and the default was chosen empirically to minimize the amount of
complaints. (but it's not possible to eliminate them altogether, for the
reasons outlined above - hence the switch.)
Ingo
-
Sorry, but I think you got me wrong: I didn't mean emulation of any implementation, but probably the some thing you write above: emulation of time/performance. In my opinion this should be done experimentally too, but with something more objective and constant than current "complaints counter". And the first thing could be a try to set some kind of linux internal "standard of yeld" for the future by averaging a few most popular systems in a test doing things like this iperf or preferably more. Jarek P. -
By definition, yield is essentially undefined as to the behaviour between SCHED_OTHER tasks at the same priority level (ie. all of them), because SCHED_OTHER scheduling behaviour itself is undefined. It's never going to do exactly what everybody wants, except those using it for legitimate reasons in realtime applications. -
That's why I've used words like: "not differ too much" and "within some limits" above. So, it's only about being reasonable, compared to our previous versions, and to others, if possible. This should be not impossible to additionally control (delay or accelerate) yielding tasks wrt. current load/weight/number_of_tasks etc., if we know (after testing) eg. average expedition time of such tasks with various schedulers. Of course, such tests and controlling paremeters can change for some time until the problem is explored enough, and still no aim for exactness or to please everybody. BTW, it looks like risky to criticise sched_yield too much: some people can misinterpret such discussions and stop using this at all, even where it's right. Regards, Jarek P. -
Really, i have never seen a _single_ mainstream app where the use of sched_yield() was the right choice. Fortunately, the sched_yield() API is already one of the most rarely used scheduler functionalities, so it does not really matter. [ In my experience a Linux scheduler is stabilizing pretty well when the discussion shifts to yield behavior, because that shows that everything else is pretty much fine ;-) ] But, because you assert it that it's risky to "criticise sched_yield() too much", you sure must know at least one real example where it's right to use it (and cite the line and code where it's used, with specificity)? Ingo -
It can occasionally be an optimization. You may have a case where you can do something very efficiently if a lock is not held, but you cannot afford to wait for the lock to be released. So you check the lock, if it's held, you yield and then check again. If that fails, you do it the less optimal way (for example, dispatching it to a thread that *can* afford to wait). It is also sometimes used in the implementation of spinlock-type primitives. After spinning fails, yielding is tried. I think it's also sometimes appropriate when a thread may monopolize a mutex. For example, consider a rarely-run task that cleans up some expensive structures. It may need to hold locks that are only held during this complex clean up. One example I know of is a defragmenter for a multi-threaded memory allocator, and it has to lock whole pools. When it releases these locks, it calls yield before re-acquiring them to go back to work. The idea is to "go to the back of the line" if any threads are blocking on those mutexes. There are certainly other ways to do these things, but I have seen cases where, IMO, yielding was the best solution. Doing nothing would have been Can you explain what the current sched_yield behavior *is* for CFS and what the tunable does to change it? The desired behavior is for the current thread to not be rescheduled until every thread at the same static priority as this thread has had a chance to be scheduled. Of course, it's not clear exactly what a "chance" is. The semantics with respect to threads at other static priority levels is not clear. Ditto for SMP issues. It's also not clear whether threads that yield should be rewarded or punished for doing so. DS -
These are generic statements, but i'm _really_ interested in the specifics. Real, specific code that i can look at. The typical Linux distro consists of in execess of 500 millions of lines of code, in tens of thousands of apps, so there really must be some good, valid and "right" use of sched_yield() somewhere in there, in some mainstream app, right? (because, as you might have guessed it, in the past decade of sched_yield() existence i _have_ seen my share of sched_yield() utilizing user-space code, and at the moment i'm not really impressed by those examples.) Preferably that example should show that the best quality user-space lock implementation in a given scenario is best done via sched_yield(). Actual code and numbers. (And this isnt _that_ hard. I'm not asking for a full RDBMS implementation that must run through SQL99 spec suite. This is about a simple locking primitive, or a simple pointer to an existing (user-space spinlocks are broken beyond words for anything but perhaps at a quick glance this seems broken too - but if you show the specific code i might be able to point out the breakage in detail. (One underlying problem here appears to be fairness: a quick unlock/lock sequence may starve out other threads. yield wont solve that fundamental problem either, and it will introduce random latencies into apps using sure. (and i described that flag on lkml before) The sched_yield flag does two things: - if 0 ("opportunistic mode"), then the task will reschedule to any other task that is in "bigger need for CPU time" than the currently running task, as indicated by CFS's ->wait_runtime metric. (or as indicated by the similar ->vruntime metric in sched-devel.git) - if 1 ("agressive mode"), then the task will be one-time requeued to the right end of the CFS rbtree. This means that for one instance, all other tasks will run before this task will run again - after that do you realize that this "desired behavior" you ...
Maybe, maybe not. Even if so, it would be very difficult to find. Simply grepping for sched_yield is not going to help because determining whether a User-space spinlocks are broken so spinlocks can only be implemented in kernel-space? Even if you use the kernel to schedule/unschedule the tasks, You are assuming that random latencies are necessarily bad. Random latencies Thank you. Unfortunately, neither of these does what sched_yiled is really supposed to do. Opportunistic mode does too little and agressive mode does I don't have a problem with failing to emulate the old scheduler's behavior if we can show that the new behavior has saner semantics. Unfortunately, in this case, I think CFS' semantics are pretty bad. Neither of these is what sched_yield is supposed to do. Note that I'm not saying this is a particularly big deal. And I'm not calling CFS' behavior a regression, since it's not really better or worse than the old behavior, simply different. I'm not familiar enough with CFS' internals to help much on the implementation, but there may be some simple compromise yield that might work well enough. How about simply acting as if the task used up its timeslice and scheduling the next one? (Possibly with a slight reduction in penalty or reward for not really using all the time, if possible?) DS -
firstly, there's no notion of "timeslices" in CFS. (in CFS tasks "earn" a right to the CPU, and that "right" is not sliced in the traditional sense) But we tried a conceptually similar thing: to schedule not to the end of the tree but into the next position. That too was bad for _some_ apps. CFS literally cycled through 5-6 different yield implementations in its 22 versions so far. The current flag solution was achieved in such an iterative fashion and gives an acceptable solution to all app categories that came up so far. [ and this is driven by compatibility goals - regardless of how broken we consider yield use. The ideal solution is of course to almost never use yield. Fortunately 99%+ of Linux apps follow that ideal solution ;-) ] Ingo -
http://www.koders.com/ (which has been around for a long time) actually seems to have more code. It's also a pity that so much free code is behind passwords and protected from spiders. -Andi -
From kernel/sched_fair.c:
"/*
* Targeted preemption latency for CPU-bound tasks:
* (default: 20ms, units: nanoseconds)
*
* NOTE: this latency value is not the same as the concept of
* 'timeslice length' - timeslices in CFS are of variable length.
* (to see the precise effective timeslice length of your workload,
* run vmstat and monitor the context-switches field)
..."
So, no notion of something, which are(!) of variable length, and which
precise effective timeslice lenght can be seen in nanoseconds? (But
not timeslice!)
Nevertheless, it seems, this 1% is important enough to boast a little:
"( another detail: due to nanosec accounting and timeline sorting,
sched_yield() support is very simple under CFS, and in fact under
CFS sched_yield() behaves much better than under any other
scheduler i have tested so far. )"
[Documentation/sched-design-CFS.txt]
Cheers,
Jarek P.
-
You should really read and understand the code you are arguing about :-/ In the 2.6.22 scheduler, there was a p->time_slice per task variable that could be manipulated. (Note, in 2.6.22's sched_yield() did not manipulate p->time_slice.) sysctl_sched_latency on the other hand is not something that is per task (it is global) so there is no pending timeslice to be "cleared" as it has been suggested naively. Ingo -
Maybe you could help me with better comments? IMHO, it would be enough to warn new timeslices have different meaning, or stop to use this term at all. (Btw, in -rc8-mm2 I see new sched_slice() function which But, there is this "something", very similar and very misleading, you count eg. in check_preempt_curr_fair to find if time is over, and I think this could be similar enough to what David Schwartz wanted to use in his idea, and you didn't care to explain why it's so different? Jarek P. -
i'm curious, what better do you need than the very detailed comment quoted above? Which bit of "this latency value is not the same as the concept of timeslice length" is difficult to understand? The timeslices of tasks (i.e. the time they spend on a CPU without scheduling away) is _not_ maintained directly in CFS as a per-task variable that can be "cleared", it's not the metric that drives scheduling. Yes, of course CFS too "slices up CPU time", but those slices are not the per-task wrong again. That is a function, not a variable to be cleared. (Anyway, the noise/signal ratio is getting increasingly high in this thread with no progress in sight, so i cannot guarantee any further replies - possibly others will pick up the tab and explain/discuss any other questions that might come up. Patches are welcome of course.) Ingo -
It's not about this comment alone, but this comment plus "no notion" I can't see anything about clearing. I think, this was about charging, which should change the key enough, to move a task to, maybe, a better place in a que (tree) than with current ways. Jarek P. PS: Don't you think that a nice argue with some celebrity, like Ingo Molnar himself, is by far more interesting than those dull patches? -
just a quick patch, not tested and I've not evaluated all possible
implications yet.
But someone might give it a try with his/(her -- are even more
welcomed :-) favourite sched_yield() load.
(and white space damaged)
--- sched_fair-old.c 2007-10-03 12:45:17.010306000 +0200
+++ sched_fair.c 2007-10-03 12:44:46.899851000 +0200
@@ -803,7 +803,35 @@ static void yield_task_fair(struct rq *r
update_curr(cfs_rq);
return;
+ } else if (sysctl_sched_compat_yield == 2) {
+ unsigned long ideal_runtime, delta_exec,
+ delta_exec_weighted;
+
+ __update_rq_clock(rq);
+ /*
+ * Update run-time statistics of the 'current'.
+ */
+ update_curr(cfs_rq);
+
+ /*
+ * Emulate (speed up) the effect of us being preempted
+ * by scheduler_tick().
+ */
+ ideal_runtime = sched_slice(cfs_rq, curr);
+ delta_exec = curr->sum_exec_runtime -
curr->prev_sum_exec_runtime;
+
+ if (ideal_runtime > delta_exec) {
+ delta_exec_weighted = ideal_runtime - delta_exec;
+
+ if (unlikely(curr->load.weight != NICE_0_LOAD)) {
+ delta_exec_weighted =
calc_delta_fair(delta_exec_weighted,
+
&se->load);
+ }
+ se->vruntime += delta_exec_weighted;
+ }
+ return;
}
+
/*
* Find the rightmost entry in the rbtree:
--
Best regards,
Dmitry Adamushko
s/curr/se -- Best regards, Dmitry Adamushko -
Thanks very much! Alas, I'll be able to look at this and try only in the evening. Best regards, Jarek P. -
thanks Dmitry. Btw., this is quite similar to the yield_granularity patch i did originally, just less flexible. It turned out that apps want either zero granularity or "infinite" granularity, they dont actually want something inbetween. That's the two extremes that the current sysctl expresses in essence. Ingo -
On Wed, Oct 03, 2007 at 12:55:34PM +0200, Dmitry Adamushko wrote: Of course, after some evaluation by yourself and Ingo the most interesting should be Martin's Michlmayr testing, so I hope you'll Cc him too?! Jarek P. -
My current take on this: queue the current task right to the next position in the tree (this is what this patch achieves in essence) was one of the yield implementations we already tried in CFS but it didnt meet the expectations of some apps. So i can only repeat my argument: this is not something that can be "solved" in the way you imagine and your arguments just reiterate the path that CFS has already taken in the past. So please do not expect _us_ to go out and pester people. If people feel so inclined, they are of course welcome to test out various approaches. (they might as well try the original yield-granularity patch which also makes the amount of "delay" tunable, so the ideal amount of delay can be figured out. And of course they should also try the existing yield flag.) Ingo -
I'm terribly sorry! Of course, the last thing I would like is to pester anybody. I simply wasn't sure you've told about the same idea. And of course, there is no reason to go back to something checked before. Thanks, Jarek P. -
ok - i've re-read it and it indeed is somewhat confusing without additional context. I'll improve the wording. (sched-design-CFS.txt needs an update anyway) Ingo -
It still gives us a target time, so could we not simply have sched_yield put the thread completely to sleep for the given amount of time? It wholly redefines the operation, and its far more expensive (now there's a whole new timer involved) but it might emulate the expected behavior. Its hideous, but so is sched_yield in the first place, so why not? --CJD -
user-space spinlocks (in anything but SCHED_FIFO tasks) are pretty broken because they waste CPU time. (not as broken as yield, because "wasting CPU time" is a more deterministic act, but still broken) Could you cite a single example where user-space spinlocks are technically the best solution? Ingo -
i'm not really assuming anything, i gave a vague first impression of the vague example you gave (assuming that the yield was done to combat fairness problems). This is a case where the human language shows its boundaries: statements that are hard to refute with certainty because they are too vague. So i'd really suggest you show me some sample/real code - that would move this discussion to a much more productive level. but i'll attempt to weave the chain of argument one step forward (in the hope of not distorting your point in any way): _if_ the sched_yield() call in that memory allocator is done because it uses a locking primitive that is unfair (hence the memory pool lock can be starved), then the "guaranteed large latency" is caused by "guaranteed unfairness". The solution is not to insert a random latency (via a sched_yield() call) that also has a side-effect of fairness to other tasks, because this random latency introduces guaranteed unfairness for this particular task. The correct solution IMO is to make the locking primitive more fair _without_ random delays, and there are a number of good techniques for that. (they mostly center around the use of futexes) one thing that is often missed is that most of the cost of a yield() is in the system call and the context-switch - quite similar to the futex slowpath. So there's _no_ reason to not use a futexes on Linux. (yes, there might be historic/compatibility or ease-of-porting arguments but those do not really impact the fundamental argument of whether something is technically right or not.) Ingo -
sched_yield() has been around for a decade (about three times longer than futexes were around), so if it's useful, it sure should have grown some 'crown jewel' app that uses it and shows off its advantages, compared to other locking approaches, right? For example, if you asked me whether pipes are the best thing for certain apps, i could immediately show you tons of examples where they are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or just about any other core kernel concept or API. Your notion that showing a good example of an API would be "difficult" because it's hard to determine "smart" use is not tenable i believe and does not adequately refute my pretty plain-meaning "it does not exist" assertion. If then this is one more supporting proof for the fundamental weakness of the sched_yield() API. Rarely are we able to so universally condemn an API: real-life is usually more varied and even for theoretically poorly defined APIs _some_ sort of legitimate use does grow up. APIs that are not in any real, meaningful use, despite a decade of presence are not really interesting to me personally. (especially in this case where we know exactly _why_ the API is used so rarely.) Sure we'll continue to support it in the best possible way, with the usual kernel maintainance policy: without hurting other, more commonly used APIs. That was the principle we followed in previous schedulers too. And if anyone has a patch to make sched_yield() better than it is today, i'm of course interested in it. Ingo -
But sched_yield() on Linux never did what the majority of programmers assumed it would do (give up the CPU to some runnable processes for the rest of the time-slice). Instead, it just appeared to spin in the kernel. Therefore, those who needed a sched_yield(), just used usleep(). Whether or not there is a POSIX definition of sched_yield(), there is a need for something that will give up the CPU and not busy-wait. There are many control applications where state-machines are kept in user-mode code. The code waits for an event. It shouldn't be spinning, wasting CPU time, when the kernel can be doing file and network I/O with the wasted CPU cycles. So, just because sched_yield() doesn't work as expected, is not the reason to get rid of it. Cheers, Dick Johnson Penguin : Linux version 2.6.16.24 on an i686 machine (5592.59 BogoMips). My book : http://www.AbominableFirebug.com/ _ **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. -
These "control applications" would be real-time processes, for which (AIUI) sched_yield() behavior is completely well-defined and implemented as such by Linux. The question here is how useful the call is for SCHED_OTHER (non-real-time) processes, for which it has no well-defined semantics. -Doug -
On Tue, 2007-10-02 at 08:46 +0200, Ingo Molnar wrote: Do you still have intentions to add a directed yield API? I remember seeing it in the earlier CFS patches. - Eric -
That is a *terrible* disgusting way to use yield. Better options: (1) inotify/dnotify (2) create a "foo.lock" file and put the mutex in that It works better than doing the wait-loop from the start? What evidence do you provide to support this assertion? Specifically, in the first case you tell the kernel "I'm waiting for something but I don't know what it is or how long it will take"; while in the second case you tell the kernel "I'm waiting for something that will take exactly X milliseconds, even though I don't know what it is. If you really want something similar to the old behavior then just replace the "sched_yield()" call with a proper sleep for the estimated time We weren't looking for "actual uses", especially not in binary-only apps. What we are looking for is optimal uses of sched_yield(); ones where that is the best alternative. This... certainly isn't. Cheers, Kyle Moffett --
Sure, tie yourself to a Linux-specific mechanism that may or may not work Right, tie yourself to process-shared mutexes which historically weren't available on Linux. That's much better than an option that's been stable for How is that better? There is literally no improvement, since the first check The evidence is that more than half the time, this avoids the sleep. That means it has zero cost, since the yield is no heavier than a sleep would be, The problem is that if the estimate is too short, pre-emption will result in a huge performance drop. If the estimate is too long, there will be some Your standards for "optimal" are totally unrealistic. In his case, it was optimal. Using platform-specific optimizations would have meant more development and test time for minimal benefit. Sleeping first would have had some performance cost and no benefit. In his case, sched_yield was optimal. Really. DS --
On Mon, 1 Oct 2007 09:49:35 -0700 at this point it's "use a futex" instead; once you're doing system calls you might as well use the right one for what you're trying to achieve. -
There are two answers to this. One is that you sometimes are writing POSIX code and Linux-specific optimizations don't change the fact that you still need a portable implementation. The other answer is that futexes don't change anything in this case. In fact, in the last time I hit this, the lock was a futex on Linux. Nevertheless, that doesn't change the basic issue. The lock is locked, you cannot afford to wait for it, but not getting the lock is expensive. The solution is to yield and check the lock again. If it's still held, you dispatch to another thread, but many times, yielding can avoid that. A futex doesn't change the fact that sometimes you can't afford to block on a lock but nevertheless would save significant effort if you were able to acquire it. Odds are the thread that holds it is about to release it anyway. That is, you need something in-between "non-blocking trylock, fail easily" and "blocking lock, do not fail", but you'd rather make forward progress without the lock than actually block/sleep. DS -
On Mon, 1 Oct 2007 15:17:52 -0700 yielding IS blocking. Just with indeterminate fuzzyness added to it.... -
Yielding is sort of blocking, but the difference is that yielding will not idle the CPU while blocking might. Yielding is sometimes preferable to blocking in a case where the thread knows it can make forward progress even if it doesn't get the resource. (As in the examples I explained.) DS -
On Mon, 1 Oct 2007 15:44:09 -0700 not really; SOMEONE will make progress, the one holding the lock. Granted, he can be on some other cpu, but at that point all yielding that's also what trylock is for... as well as spinaphores... (you can argue that futexes should be more intelligent and do spinaphore stuff etc... and I can buy that, lets improve them in the kernel by any means. But userspace yield() isn't the answer. A yield_to() would have been a ton better (which would return immediately if the thing you want to yield to is running already somethere), a blind "yield" isn't, since it doesn't say what you want to yield to. Note: The answer to "what to yield to" isn't "everything that might want to run"; we tried that way back when the 2.6.early scheduler was designed and that turns out to not be what people calling yield expected.. (it made their things even slower than they thought). So they want "yield to" semantics, without telling the kernel what they want to yield to, and complain if the kernel second-guesses wrongly.... not a good api. -
So now I not only have to come up with an example where sched_yield is the best practical choice, I have to come up with one where sched_yield is the best conceivable choice? Didn't we start out by agreeing these are very rare cases? Why are we designing new APIs for them (Arjan) and why do we care about their performance (Ingo)? These are *rare* cases. It is a waste of time to optimize them. In this case, nobody cares about fairness to the service thread. It is a cleanup task that probably runs every few minutes. It could be delayed for minutes and nobody would care. What they do care about is the impact of the service thread on the threads doing real work. You two challenged me to present any legitimate use case for sched_yield. I see now that was not a legitimate challenge and you two were determined to shoot down any response no matter how reasonable on the grounds that there is some way to do it better, no matter how complex, impractical, or unjustified by the real-world problem. I think if a pthread_mutex had a 'yield to others blocking on this mutex' kind of a 'go to the back of the line' option, that would cover the majority of cases where sched_yield is your best choice currently. Unfortunately, POSIX gave us yield. Note that I think we all agree that any program whose performance relies on quirks of sched_yield (such as the examples that have been cited as CFS 'regressions') are broken horribly. None of the cases I am suggesting use sched_yield as anything more than a minor optimization. DS -
On 02-10-2007 17:37, David Schwartz wrote: Probably we'll start to care after first comparison tests done by our rivals. It should be a piece of cake for them to find the "right" code... Regards, Jarek P. -
How about: Check the lock. If it is held, sleep for an interval that is shorter than acceptable waiting time. If it is still held, sleep for twice as long. Loop until you get the lock and do the work, or until you you reach the limit for how much you can wait at this point and dispatch to a thread instead. This approach should be portable, don't wake up too often, and don't waste the CPU. (And it won't go idle either, whoever holds the lock will be running.) Helge Hafting -
This used to be true, and still is if you want to be portable. But the point of futexes was precisely to attack this use case: whereas sched_yield() says "I'm waiting for something, but I won't tell you what" the futex ops tells the kernel what you're waiting for. While the time to do a futex op is slightly slower than sched_yield(), futexes win in so many cases that we haven't found a benchmark where yield wins. Yield-lose cases include: 1) There are other unrelated process that yield() ends up queueing behind. 2) The process you're waiting for doesn't conveniently sleep as soon as it releases the lock, so you wait for longer than intended, 3) You race between the yield and the lock being dropped. In summary: spin N times & futex seems optimal. The value of N depends on the number of CPUs in the machine and other factors, but N=1 has shown itself pretty reasonable. Hope that helps, Rusty. -
It's fine to criticise sched_yield(). I agree that new apps should generally be written to use proper completion mechanisms or to wait for specific events. However, there are closed-source and/or frozen-source apps where it's not practical to rewrite or rebuild the app. Does it make sense to break the behaviour of all of these? Chris -
See the background and answers to that in: http://lkml.org/lkml/2007/9/19/357 http://lkml.org/lkml/2007/9/19/328 there's plenty of recourse possible to all possible kinds of apps. Tune the sysctl flag in one direction or another, depending on which behavior the app is expecting. Ingo -
Yeah, I read those threads. It seems like the fundamental source of the disconnect is that the tasks used to be sorted by priority (thus making it easy to bump a yielding task to the end of that priority level) while now they're organized by time (making it harder to do anything priority-based). Do I have that right? Chris -
not really - the old yield implementation in essence gave the task a time hit too, because we rotated through tasks based on timeslices. But the old one requeued yield-ing tasks to the 'active array', and the decision whether a task is in the active or in the expired array was a totally stohastic, load-dependent thing. As a result, certain tasks, under certain workloads saw a "stronger" yield, other tasks saw a "weaker" yield. (The reason for that implementation was simple: yield was (and is) unimportant and it was implemented in the most straightforward way that caused no overhead anywhere else in the scheduler.) ( and to keep perspective it's also important to correct the subject line here: it's not about "network slowdown" - nothing in networking slowed down in any way - it was that iperf used yield in a horrible way. I changed the subject line to reflect that. ) Ingo -
Very clever move! And I see some people have catched this... Since sched_yeld() is a very general purpose tool, it can be easily replaced by others, of course, just like probably half of all system calls. And such things are done often in a code during optimization. But, IMHO, the main value of sched_yield() is it's easy to use and very readable way to mark some place. Sometimes even test if such idea is reasonable at all... Otherwise, many such possibilities could easily stay forgotten forever. But you are right, the value of this call shouldn't be exaggerated, and my proposal was an overkill. Anyway, it seems, there could be imagined something better than current ways of doing this. They look like two extremes and something in between and not too complicated should suffice. Currently, I wonder if simply charging (with a key recalculated) such a task for all the time it could've used isn't one of such methods. It seems, it's functionally analogous with going to the end of que of tasks with the same priority according to the old sched. Regards, Jarek P. -
On Tue, Oct 02, 2007 at 11:03:46AM +0200, Jarek Poplawski wrote: Only now I've read I repeat the idea of David Schwartz (and probably not only him) from a nearby thread, sorry. But, I still try to find what was wrong with it? Jarek P. -
On Mon, Oct 01, 2007 at 10:43:56AM +0200, Jarek Poplawski wrote: No new theory - it's only my reverse Polish translation. Should be: "etc., if we know (after testing) eg. average dispatch time of such". Sorry, Jarek P. -
Martin, could you check the iperf patch below instead of the yield patch
- does it solve the iperf performance problem equally well, and does CPU
utilization drop for you too?
Ingo
-------------------------->
Subject: iperf: fix locking
From: Ingo Molnar <mingo@elte.hu>
fix iperf locking - it was burning CPU time while polling
unnecessarily, instead of using the proper wait primitives.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
compat/Thread.c | 3 ---
src/Reporter.c | 13 +++++++++----
src/main.cpp | 2 ++
3 files changed, 11 insertions(+), 7 deletions(-)
Index: iperf-2.0.2/compat/Thread.c
===================================================================
--- iperf-2.0.2.orig/compat/Thread.c
+++ iperf-2.0.2/compat/Thread.c
@@ -405,9 +405,6 @@ int thread_numuserthreads( void ) {
void thread_rest ( void ) {
#if defined( HAVE_THREAD )
#if defined( HAVE_POSIX_THREAD )
- // TODO add checks for sched_yield or pthread_yield and call that
- // if available
- usleep( 0 );
#else // Win32
SwitchToThread( );
#endif
Index: iperf-2.0.2/src/Reporter.c
===================================================================
--- iperf-2.0.2.orig/src/Reporter.c
+++ iperf-2.0.2/src/Reporter.c
@@ -111,6 +111,7 @@ report_statistics multiple_reports[kRepo
char buffer[64]; // Buffer for printing
ReportHeader *ReportRoot = NULL;
extern Condition ReportCond;
+extern Condition ReportDoneCond;
int reporter_process_report ( ReportHeader *report );
void process_report ( ReportHeader *report );
int reporter_handle_packet( ReportHeader *report );
@@ -338,7 +339,7 @@ void ReportPacket( ReportHeader* agent,
// item
while ( index == 0 ) {
Condition_Signal( &ReportCond );
- thread_rest();
+ Condition_Wait( &ReportDoneCond );
index = agent->reporterindex;
}
agent->agentindex = 0;
@@ -346,7 +347,7 @@ void ReportPacket( ...Yes, it works and CPU goes down too. -- Martin Michlmayr http://www.cyrius.com/ -
i'm curious by how much does CPU go down, and what's the output of iperf? (does it saturate full 100mbit network bandwidth) Ingo -
I get about 94-95 Mbits/sec and CPU drops from 99% to about 82% (this is with a 600 MHz ARM CPU). -- Martin Michlmayr http://www.cyrius.com/ -
