-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hi list, I was working on some unit tests and thought I'd give CFS a whirl to see if it had any impact on my workloads (to see what the fuss was about), and I came up with some pretty disturbing numbers: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... As above but also showing the load average: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... Looks like a regression to me... Basically, all the previous kernels are pretty close (2.6.16 through to 2.6.20 performed almost identically to 2.6.22 and are not shown here to avoid cluttering the graphs) All the 2.6.23-rc kernels performed poorly (except -rc3!): much more erratically and with a sharp performance drop above 800 threads. The load starts to go up and the performance takes a nosedive. With fewer threads (less than 50) there is hardly any difference at all between all the kernels. Notes about the tests and setup: * environment is: Dual Opteron 252 with 3GB ram, scsi disk, etc.. Sun Java 1.6 MySQL 5.0.44 Junit + ant + my test code (devloop.org.uk) * java threads are created first and the data is prepared, then all the threads are started in a tight loop. Each thread runs multiple queries with a 10ms pause (to allow the other threads to get scheduled) * load average is divided by the number of cpus (2) * more general information (which also covers some irrelevant information about some other tests I have published) is here: http://devloop.org.uk/documentation/database-performance/Setup/ Don't shoot the messenger! I can run some more tests if needed (bearing in mind that a full test run takes a few hours) or you can run the tests yourself: instructions on running the tests are included. I am now re-testing with sched-cfs-v2.6.23-rc6-v21-combo-2.patch but feel free to send some other patches. Antoine -----BEGIN PGP SIGNATURE----- Versio...
hm, could you try the patch below ontop of 2.6.23-rc6 and do: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround does this improve the numbers? Ingo --------------> Subject: sched: yield debugging From: Ingo Molnar <mingo@elte.hu> introduce various sched_yield implementations: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround tunability depends on CONFIG_SCHED_DEBUG=y. Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1392,10 +1392,12 @@ extern void sched_idle_next(void); #ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned int sysctl_sched_yield_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; #endif Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -42,6 +42,16 @@ const_debug unsigned int sysctl_sched_la */ const_debug unsigned int sysctl_sched_child_runs_first = 1; +const_debug unsigned int sysctl_sched_yiel...
the patch i sent was against CFS-devel. Could you try the one below, which is against vanilla -rc6, does it improve the numbers? (it should have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the sysctl. Ingo ------------------------> Subject: sched: yield debugging From: Ingo Molnar <mingo@elte.hu> introduce various sched_yield implementations: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround tunability depends on CONFIG_SCHED_DEBUG=y. Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1402,10 +1402,12 @@ extern void sched_idle_next(void); extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned int sysctl_sched_yield_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read */ unsigned int sysctl_sched_min_granularity __read_mostly = 2000...
Hi Antoine, Ingo, Hmm, I know diddly about Java, and I don't want to preempt Antoine's next test, but I noticed that he uses Thread.sleep() in his testcode and not the Thread.yield() so it would be interesting if Antoine can test with this patch This is an interesting data point, IMHO ... considering these tests are long, I suspect you ran them only once each per kernel. So I wonder how reliable that -rc3 testpoint is. If this oddity is reproducible, it would be great if you could git-bisect: 1. between 23-rc1 and 23-rc3, and find out which commit led to the improvement in performance, and, 2. between 23-rc3 and 23-rc6, and find out which commit brought down the numbers again. [ http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html, Don't know much about CFS either, but does that constant "10 ms" sleep somehow lead to evil synchronization issues between the test threads? Thanks, Satyam -
[ Argh, just noticed this thread got broke and had been living a parallel life due to Subject: changes, dropped Cc:'s, and munged In-Reply-To:'s. Ok, it appears Antoine tested with the patch and got: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... which leads me to believe this probably wasn't a yield problem after all, though it would still be useful if someone with more knowledge of Java could give that code a look over ... Curiously, the -rc3 oddity is still plainly visible there -- how do we explain that? Ingo, does that oddity (and the commits that went in around -rc3 time) give some Ok, it's reproducible, making our job easier. Which also means, Antoine, I don't have access to any real/meaningful SMP systems, so I wonder how much sense it makes in practical terms for me to try and run these tests locally on my little boxen ... would be helpful if someone with 4/8 CPU systems could give Umm, you mention _changing_ this value earlier, but it still remains the same for every thread during every loop for a given test run -- what I'm suggesting is making that code do something like: Thread.sleep(random(x, y)); where x=2, y=20 and random(x, y) returns a random integer between x and y, so all threads sleep for different durations in every loop, but still average out to about ~10 ms over a period. Try to vary x and y (to vary the average) and post the resulting graphs too? CONFIG_HZ (actually, full .config) and dmesg might be useful for us as well. Also, like David mentioned, counting the _number_ of times the test threads managed to execute those SQL queries is probably a better benchmark than measuring the time it takes for threads to finish execution itself -- uniformity in that number across threads would bring out how "fair" CFS is compared to previous kernels, for ...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 It should still work... if you're patient. http://devloop.org.uk/documentation/database-performance/dmesg-dualopteron.gz Good idea, I've added code to capture more fairness oriented attributes: * time until the first thread finishes - this should actually be higher when the scheduler is fairer. * total number of loop iterations executed when the first thread finishes (higher is better) I think he does have a valid point: maybe, the locality of the java threads is important, but there are so many of them and so many layers below it (jdbc pool, backend database threads, filesystem, device) that the various caches get thrashed quite often anyway, no matter what the interleaving factor is. Which means that being fairer in this case also means switching threads more often and hitting the caches' capacity limits earlier. In which case the granularity and CONFIG_HZ should have a noticeable impact. Ingo did suggest tweaking /proc/sys/kernel/sched_yield_granularity_ns, which I did but this particular test timed out when It ran over the weekend... will run it again now. (results in a couple of days) Maybe the granularity should auto-adjust to prevent such cases? (probably not) Granularity (and therefore fairness) become less important as the number of threads becomes very large, as fairness starts to adversely affects throughput? I have changed my mind and now think that this is not a regression, or at least not one we should worry too much about. As David said, this workload is "pathological". And CFS, does show some noticeable improvements with the new measurements (now using a shorter thread yield of 5ms and a higher number of loop iterations per thread: 25): It does not adversely affect throughput as much (will test with more threads later): http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... The number of threads that exit before we reach half the total number ...
[snip] Ick, threading breaks in Gmail with Subject: changes, so I missed the latest updates on this thread ... oh well, never mind. Satyam -
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 See below... It looks good now! Updated results here: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... Compared with more kernels here - a bit more cluttered: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... Thanks Ingo! Does this mean that I'll have to keep doing: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround Or are you planning on finding a more elegant solution? # find /proc -name "*workaround*" /proc/sys/kernel/sched_yield_bug_workaround Yeah, I thought that was quite suspicious. - -rc2 is just like -rc1 (see above) so I'll re-test -rc3 first, git-bisect could take a while with those tests... just wiping the disk I've tested before with varying timings, but I had not thought of using a randomized delay. Will add that too. Many thanks to you all for the feedback! Antoine -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG6qfkGK2zHPGK1rsRCgeEAJ9HUUtHUNScvTVKo5z2sSmo+G+BVgCfdYmK rcd1VYUuzQA2oFEmakjZxgM= =jmI8 -----END PGP SIGNATURE----- -
just to make sure - can you get it to work fast with the -rc6+yield-patch solution too? (i.e. not CFS-devel) We need a (tested) solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've attached below the latest version of the -rc6 yield patch - the switch is not dependent on SCHED_DEBUG anymore but always available. Ingo -------------------> Subject: sched: yield workaround From: Ingo Molnar <mingo@elte.hu> sched_yield() is fundamentally broken, and CFS has changed its behavior. Some apps that mistakenly rely on sched_yield() want "weak" sched yield (such as desktop apps) - while some apps want "strong" sched_yield() (for example some JDKs). There's no way for the scheduler to figure out which of the two variants the app really wants - because sched_yield() is all about hiding from the kernel the true structure of the user-space locking code. As a solution, provide a workaround, to introduce a more agressive sched_yield implementation: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround in the future, the use of this sysctl might generate a deprecation warning, so that apps start moving away from their reliance on sched_yield(). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1402,10 +1402,12 @@ extern void sched_idle_next(void); extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned ...
Is this going to be merged? And will you be making the default == 1 or just leaving it at 0, which forces people who want the older behavior to modify the default? -
not at the moment - Antoine suggested that the workload is probably fine and the patch against -rc6 would have no clear effect anyway so we have nothing to merge right now. (Note that there's no "older behavior" possible, unless we want to emulate all of the O(1) scheduler's behavior.) But ... we could still merge something like that patch, but a clearer testcase is needed. The JVM's i have access to work fine. Ingo -
I just got a bug report today: https://bugzilla.redhat.com/show_bug.cgi?id=295071 ================================================== Description of problem: The CFS scheduler does not seem to implement sched_yield correctly. If one program loops with a sched_yield and another program prints out timing information in a loop. You will see that if both are taskset to the same core that the timing stats will be twice as long as when they are on different cores. This problem was not in 2.6.21-1.3194 but showed up in 2.6.22.4-65 and continues in the newest released kernel 2.6.22.5-76. Version-Release number of selected component (if applicable): 2.6.22.4-65 through 2.6.22.5-76 How reproducible: Very Steps to Reproduce: compile task1 int main() { while (1) { sched_yield(); } return 0; } and compile task2 #include <stdio.h> #include <sys/time.h> int main() { while (1) { int i; struct timeval t0,t1; double usec; gettimeofday(&t0, 0); for (i = 0; i < 100000000; ++i) ; gettimeofday(&t1, 0); usec = (t1.tv_sec * 1e6 + t1.tv_usec) - (t0.tv_sec * 1e6 + t0.tv_usec); printf ("%8.0f\n", usec); } return 0; } Then run: "taskset -c 0 ./task1" "taskset -c 0 ./task2" You will see that both tasks use 50% of the CPU. Then kill task2 and run: "taskset -c 1 ./task2" Now task2 will run twice as fast verifying that it is not some anomaly with the way top calculates CPU usage with sched_yield. Actual results: Tasks with sched_yield do not yield like they are suppose to. Expected results: The sched_yield task's CPU usage should go to near 0% when another task is on the same CPU. -
sched_yield is a very poorly defined interface as it does not tell the kernel anything about what kind of locking user-space does. When an app calls the syscall it basically tells the scheduler: "uhm, ok, i'm blocked and i want you to do something else now. I dont want to tell you how long this task wants to wait, and i dont want to tell you on what condition it should be unblocked. Just ... do some stuff and let me alone. See ya." That's not an intelligent interface upon which the scheduler can do anything particularly intelligent (in fact, it's a very stupid interface upon which the scheduler can only do stupid things), and hence schedulers tend to implement sched_yield() quite arbitrarily and in a very scheduler-internals dependent way - which internals change when the scheduler is changed materially. The correct way to tell the kernel that the task is blocked is to use futexes for example, or any kernel-based locking or wait object - there are myriads of APIs for these. (The only well-defined behavior of yield is for SCHED_FIFO/RR tasks - and that is fully preserved in CFS.) Regarding compatible behavior: it's not possible to fully emulate the O(1) scheduler's yield behavior, because the yield implementation depended on so many scheduler internals. We changed yield when we went from 2.4 to 2.6, and even in 2.6 we changed it a number of times. To avoid the reoccuring problems of applications mistakenly relying on sched_yield(), we now context-switch on yield very weakly for SCHED_OTHER tasks (SCHED_FIFO/RR behavior is unchanged) - this is the only sane behavior that will get apps to stop using sched_yield() - and that's the difference that the above testcase shows. (there's actual user-space code executed, instead of the frequent overscheduling.) My patch below adds a sysctl flag that triggers a context-switch when yield is called (how futile and undefined and broken that might be), but that would only be for legacy reasons. We could still add t...
Certainly this is reasonable for applications for which the source is available and readily recompilable. However, there are legacy closed-source apps out there expecting sched_yield() to result in a reasonable amount of time passing before the task is scheduled again. Also, there are installed bases of people that may have older versions of code that may wish to upgrade to newer kernels without upgrading the rest of the system. It seems odd to force them to update userspace apps I've always understood one of the kernel's basic tenets to be that we don't break userspace without a good reason. If there are apps out there that expect sched_yield() to give up the cpu, it seems counter-intuitive to ignore that expectation. Personally, I'd be in favour of making the context-switch be the default behaviour, but at the very least it should be possible to enable a "backwards-compatibility mode" for sched_yield(). Chris -
I disagree. I think CFS made "sched_yield()" worse, and what you call "bug workaround" is likely the *better* behaviour. The fact is, sched_yield() is not - and should not be - about "recalculating the position in the scheduler queue" like you do now in CFS. It very much is about moving the thread *dead last* within its priority group. That's what it does for round-robin, and it's not about fairness, it's about - Opengroup: DESCRIPTION The sched_yield() function forces the running thread to relinquish the processor until it again becomes the head of its thread list. It takes no arguments. - Linux man-page: DESCRIPTION A process can relinquish the processor voluntarily without blocking by calling sched_yield. The process will then be moved to the end of the queue for its static priority and a new process gets to run. and quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the "right place" in the rbtree. It should move it *last*. Linus -
ok, we can do that. the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task. so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. in CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_. another practical problem is that this will break certain desktop apps that do calls to yield() [some firefox plugins do that, some 3D apps do that, etc.] but they dont expect to be moved 'very late' into the queue - they expect the O(1) scheduler's behavior of being delayed "a bit". (That's why i added the yield-granularity tunable.) we can make yield super-agressive, that is pretty much the only sane (because well-defined) thing to do (besides turning yield into a NOP), but there will be lots of regression reports about lost interactivity during load. sched_yield() is a mortally broken API. "fix the app" would be the answer, but still there will be lots of complaints. Ingo -
Yeah, with reasonably good reason. The sched_yield() behaviour is actually very well-defined for RT tasks (now, whether it's a good interface to use or not is still open to debate, but at least it's a _defined_ interface ;), and I think we should at least try to approximate that behaviour for normal tasks, even if they aren't RT. And the closest you can get is basically to say something that is close to "sched_yield()" on a non-RT process still means that all other runnable tasks in that same nice-level will be scheduled before the task is scheduled again. and I think that would be the optimal approximation of the RT behaviour. Now, quite understandably we might not be able to actually get that *exact* behaviour (since we mix up different nice levels), but I think from a QoI standpoint we should really have that as a "target" behaviour. So I think it should be moved back in the RB-tree, but really preferably only back to behind all other processes at the same nice-level. So I think we have two problems with yield(): - it really doesn't have very nice/good semantics in the first place for anything but strict round-robin RT tasks. We can't do much about this problem, apart from trying to make people use proper locking and other models that do *not* depend on yield(). - Linux has made the problem a bit worse by then having fairly arbitrary and different semantics over time. This is where I think it would be a good idea to try to have the above kind of "this is the ideal behaviour - we don't guarantee it being precise, but at least we'd have some definition of what people should be able to expect is the "best" behaviour. So I don't think a "globally last" option is at all the best thing, but I think it's likely better than what CFS does now. And if we then have some agreement on what would be considered a further _improvement_, then that would be a good thing - maybe we're not perfect, but at least having a vie...
i thought a bit about the statistical approach, and it's good in principle, but it has an implementational problem/complication: if there are only yielding tasks in the system, then the "queue rightwards in the tree, statistically" approach cycles through the key-space artificially fast. That can cause various problems. (this means that the workload-flag patch that uses yield_granularity is buggy as well. The queue-rightmost patch did not have this problem.) So right now there are only two viable options i think: either do the current weak thing, or do the rightmost thing. The statistical method might work too, but it needs more thought and more testing - i'm not sure we can get that ready for 2.6.23. So what we have as working code right now is the two extremes, and apps will really mostly prefer either the first (if they dont truly want to use yield but somehow it got into their code) or the second (if they want some massive delay). So while it does not have a good QoI, how about doing a compat_yield sysctl that allows the turning on of the "queue rightmost" logic? Find tested patch below. Peter, what do you think? Linus, if this would be acceptable for .23 then i'll push it out into sched.git together with another fix that Hiroshi Shimamoto just posted to lkml. Ingo --------------------------------------> Subject: sched: add /proc/sys/kernel/sched_compat_yield From: Ingo Molnar <mingo@elte.hu> add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. with sched_compat_yield=0: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop with sched_compat_yield=1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 ...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 These are pure cpu scheduling tests, not doing any I/O this time. All these tests are still "pathological" in the sense that they are only meant to show differences between schedulers rather than try to simulate real usage scenarios. all the graphs are here: http://devloop.org.uk/lkml/ Legend: * 2.6.23-rc6-yield2: "yield2" patch is this one: http://lkml.org/lkml/2007/9/14/157 * 2.6.23-rc6-yield2-highlatency is the same patch, but the kernel is built without preempt, with HZ100 and the scheduling granularity is doubled using sysctl. * 2.6.23-rc6-yield3 is CFS-v21 combo3 + the yield patch * 2.6.23-rc6-yield4 is this patch (queued for mainline? and in mm?): http://lkml.org/lkml/2007/9/19/409 of interest I found: * rc6-mm1 is not always fair (see "ManyThreadsYieldOften" tests) - the only one to have almost half the threads already finished at the half way point when yielding often. Also slower for the "RandomSleep". * increasing latency makes a noticeable difference (see "ShortPause") it can be more fair, but it also makes it a lot more erratic (see "Yield" tests) * most changes are only noticeable with a large number for threads (look for 'ManyThreads' in the filename) Summary of the code: each thread loops 25 times, incrementing a shared counter each time and calling "iteration()" which does: * "Long Pause" tests: 1000 times sqrt() followed by 25 milliseconds sleep. * "Random Sleep" tests: sleep(n) after 1000 sqrt calculations, where n is a random number between 0 and 99 milliseconds. * "Short Pause" Tests: 1 million sqrt() followed by 5ms sleep. * "Yield Often" Tests: sqrt() then yield(), both 250 times per iteration. * "Yield" Tests: 1 million sqrt() followed by a single yield(). All the source code is here: http://bugs.devloop.org.uk/devloop/browser/metastores/examples-test/src/com/metastores... Antoine PS: now testing -rc8, also added a test that consumes memory in each thread. also now rec...
wrt. yield4 did you set /proc/sys/kernel/sched_compat_yield to 1? (with sched_compat_yield at 0, which is the default, nothing changes) i'm wondering, how easy would it be for you to test the sched-devel.git tree? If you havent used git before then first install the 'git' package, then do: git-clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6.git cd linux-2.6.git ok. Ingo -
I have to agree that for .23 we can't do much more than this. And tasks moving to the right without actually doing work and advancing fair_clock do scare me a little. Also, while I agree with Linus' definition of sched_yield, I'm afraid it will cause 'regressions' for all the interactivity people out there. Somehow this yield thing has made it into all sorts of unfortunate places like video drivers, so a heavy penalizing yield will hurt them. -
it's getting late here so i've pushed the current version of those two
patches out to:
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
I'll redo this git tree if we want some other solution for yield. (but i
think this is the safest approach for 2.6.23 - some apps will complain
about too strong yield, some apps will complain about too weak yield. So
by providing the two extremes we at least cover the practical range of
behavior.)
there's nothing else pending for 2.6.23 otherwise at the moment,
scheduler-wise.
Ingo
------------------>
Hiroshi Shimamoto (1):
sched: fix invalid sched_class use
Ingo Molnar (1):
sched: add /proc/sys/kernel/sched_compat_yield
include/linux/sched.h | 1
kernel/sched.c | 10 ++++---
kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 8 ++++++
4 files changed, 72 insertions(+), 10 deletions(-)
-find below the fix that puts yielding tasks to the rightmost position in
the rbtree. I have not tested it extensively yet, but it appears to work
on x86. (i tested yield using interactive tasks and they get hurt now
under load - but this would be expected.)
Ingo
---------------------->
Subject: sched: make yield more agressive
From: Ingo Molnar <mingo@elte.hu>
make sys_sched_yield() more agressive, by moving the yielding task
to the last position in the rbtree.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 5 +----
kernel/sched_fair.c | 39 ++++++++++++++++++++++++++++++++++-----
2 files changed, 35 insertions(+), 9 deletions(-)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void)
struct rq *rq = this_rq_lock();
schedstat_inc(rq, yld_cnt);
- if (unlikely(rq->nr_running == 1))
- schedstat_inc(rq, yld_act_empty);
- else
- current->sched_class->yield_task(rq, current);
+ current->sched_class->yield_task(rq, current);
/*
* Since we are going to call schedule() anyway, there's
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -902,14 +902,43 @@ static void dequeue_task_fair(struct rq
static void yield_task_fair(struct rq *rq, struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
+ struct sched_entity *rightmost, *se = &p->se;
+ struct rb_node *parent;
- __update_rq_clock(rq);
/*
- * Dequeue and enqueue the task to update its
- * position within the tree:
+ * Are we the only task in the tree?
*/
- dequeue_entity(cfs_rq, &p->se, 0);
- enqueue_entity(cfs_rq, &p->se, 0);
+ if (unlikely(cfs_rq->n...This seems right. They're both always ready to run. They're at the same
priority. Neither ever blocks. There is no reason one should get more CPU
Nonsense. The task is always ready-to-run. There is no reason its CPU should
be low. This bug report is based on a misunderstanding of what yielding
means.
The Linux page says:
"A process can relinquish the processor voluntarily without blocking
by
calling sched_yield(). The process will then be moved to the end
of
the queue for its static priority and a new process gets to run."
Notice the "without blocking" part?
POSIX says:
"The sched_yield() function forces the running thread to relinquish the
processor until it again becomes the head of its thread list. It takes no
arguments."
CFS is perfectly complying with both of these. This bug report is a great
example of how sched_yield can be misunderstood and misused.
You can even argue that the sched_yield process should get even more CPU,
since it's voluntarily relinquishing (which should be rewarded) rather than
infinitely spinning (which should be punished). (Not that I agree with this
argument, I'm just using it to counter-balance the other argument.)
DS
-The yielding task has given up the cpu. The other task should get to run for a timeslice (or whatever the equivalent is in CFS) until the yielding task again "becomes head of the thread list". Chris -
Are you sure this isn't happening? I'll run some tests on my SMP system running CFS. But I'll bet the context switch rate is quite rapid. Honestly, I can't imagine what else could be happening here. Does CFS spin in a loop doing nothing when you call sched_yield even though another task is ready-to-run? That seems kind of bizarre. Is sched_yield acting as a no-op? DS -
Yep, that's exactly what's happening. The tasks are alternating. They are both always ready-to-run. The yielding task is put at the end of the queue for its priority level. There is no reason the yielding task should get less CPU since they're both always ready-to-run. The only downside here is that a yielding task results in very small timeslices which causes cache inefficiencies. A sane lower bound on the timeslice might be a good idea. But there is no semantic problem. DS -
Ack, sorry, I'm wrong. Please ignore me, if you weren't already. I'm glad to hear this will be fixed. The task should be moved last for its priority level. DS -
I've tried reasonalby diligently to figure out what the hell you're doing and gone through quite a bit of your documentation, and I just can't figure it out. This could entirely be the result of your test's sensitivity to execution order. For example, if you run ten threads that all insert, query, and delete from the *same* table, then the exact interleaving pattern will determine the size of the results. A slight change in the scheduling quantum could multiply the size of the result data by a huge factor. There is a big difference between: 1) Thread A inserts data. 2) Thread A queries data. 3) Thread A deletes data. 4) Thread B inserts data. ... and 1) Thread A inserts data. 2) Thread B insers data. ... 101) Thread A queries data. 102) Thread B queries data. ... There are a number of ways you might be measuring nothing but how the scheduler chooses to interleave your threads. Benchmarking threads that yield suggests just this type of thing -- if a thread has useful work to do and another thread is not going to help it, *why* *yield*? Are you worried the scheduler isn't going to schedule other threads?! Or is there some sane reason to force suboptimal scheduling when you're trying to benchmark a scheduler? Are you trying to see how it deals with pathological patterns? ;) The only documentation I can see about what you're actually *doing* says things like "The schema and statements are almost identical to the non-threaded tests." Do you see why that's not helpful? DS -
(cc's readded please reply to all when replying to lkml) Hi David, You might be sounding a bit too abrasive here... I understand you're also trying to help, but your tone just might be taken the wrong way. Antonie is really doing the right thing here to test such a new feature early and on the code he cares about as a user. And most importantly, reporting it here. This is probably the most useful resource we have in Linux. Maybe the workload is quirky, but regardless, if it is a *regression* from a previous kernel then it is really important to be brought to our -
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 I've tested a couple more kernels: 2.6.23-rc4-mm1 and 2.6.23-rc6 with the "sched_yield_bug_workaround" patch from Ingo, results are here: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... Same, with the load average (dotted lines): http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThr... You are correct, it is very sensitive to the execution order and caching. When I vary the thread pause, the total execution time varies widely. 10ms just happens to be the sweet spot which provides the best As I said above, I have tried varying the pause and this is the sweet spot. If you don't yield, the first hundred threads will be finished by the time you start the 800th - which is not what you want. I could try just yielding at the start, or only during the first I know it sounds sub-optimal, but this benchmark wasn't designed to test the scheduler! It is meant to thrash just one database table. It isn't meant to be anything like TPC either. Previously it did uncover some very noticeable differences in JVM I sure do! I'll try to improve on that when I get a chance. Here is a start: the schema is configurable with simple text files. And the database layer translates it into the SQL syntax (and types) supported by the backend (MySQL in this case): # cat ManyThreadedCombinedTest.properties table=update columns=i1:integer,i2:integer,s1:varchar,s2:varchar I'll make a click&run tarball if anyone is interested in running the tests themselves (it outputs csv data). Thanks for the feedback. Antoine -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG6Yk8GK2zHPGK1rsRCtVaAKCAVyU4woztnl6q0NAZBNsM94I2JgCcD4M3 +MpR1gAom65Mn/tQX8ig1cI= =Q3IY -----END PGP SIGNATURE----- -
First, let me apologize if the tone of my other post came through as angry or frustrated. Text can sometimes be misleading as to tone. I certainly wasn't angry. Second, let me say that I'm definitely not suggesting that you were wrong to bring this to everyone's attention. Even if it turns out that your code is horribly broken and it's all your fault, any apparent regression still has to be investigated. It is virtually certain that we will learn something interesting about either your code, CFS, or both. I was definitely *not* saying "how dare you challenge CFS' supremacy without a perfect test case". The problem is that because your access pattern is pathological, schedulers that are objectively worse may turn in much better performance. For example, a scheduler that completely ignores your attempt to sleep will probably perform significantly better than a scheduler that goes out of its way to honor it. That the execution time is very sensitive to the pause is strong evidence of this. The problem is simply that your test program doesn't do a fixed amount of work. It does a variable amount of work, and that amount depends upon scheduler details. It's like a job that has fifty threads that do this: 1) Increment a shared variable. 2) Do a math problem a number of times equal to the value of that shared variable. 3) Decrement the shared variable. The problem is that the number of times the math problem is done depends upon the execution order of the threads. To be fair, you would need to benchmark how many times the math problem gets done, not how long the threads take to complete. Now, imagine if I insert a yield between steps 1 and 2. The more a scheduler honors your yield request, the worse it will appear to perform. The scheduler that ignores it (which is legal, but definitely not The Right Thing) will seem to perform *much* better. Of course, it's also possible that this is not what's going on. DS -
| Linus Torvalds | Linux 2.6.21-rc5 |
| Linus Torvalds | Linux 2.6.26-rc4 |
| Christoph Hellwig | Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel |
| Bryan Woods | Stardom SATA HSM violation |
git: | |
| Linus Torvalds | People unaware of the importance of "git gc"? |
| Jan Holesovsky | [PATCH] RFC: git lazy clone proof-of-concept |
| Linus Torvalds | cleaner/better zlib sources? |
| martin f krafft | Re: Track /etc directory using Git |
| Chris | OpenBSD 4.4 installation error: write failed; file system full |
| Brian A. Seklecki | sshd_config(5) PermitRootLogin yes |
| steve szmidt | Re: The Atheros story in much fewer words |
| David Newman | setting dscp or tos bits |
| Jim Winstead Jr. | Re: Root Disk/Book Disk Compatibility |
| Jan Nicolai Langfeldt | Re: Hypenation problems under LaTeX. |
| Linus Torvalds | Re: Missing linux/delay.h??? |
| Stew Ellis | Trouble with minicom scripts |
