RE: CFS: some bad numbers with Java/database threading [FIXED]

Previous thread: Re: [linux-dvb] [PATCH] Userspace tuner by Markus Rechberger on Wednesday, September 12, 2007 - 7:10 pm. (56 messages)

Next thread: [RFC PATCH] Add a 'minimal tree install' target by Chris Wedgwood on Wednesday, September 12, 2007 - 7:25 pm. (12 messages)
To: Linux Kernel Development <linux-kernel@...>
Date: Wednesday, September 12, 2007 - 7:10 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hi list,

I was working on some unit tests and thought I'd give CFS a whirl to see
if it had any impact on my workloads (to see what the fuss was about),
and I came up with some pretty disturbing numbers:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...
As above but also showing the load average:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...
Looks like a regression to me...

Basically, all the previous kernels are pretty close (2.6.16 through to
2.6.20 performed almost identically to 2.6.22 and are not shown here to
avoid cluttering the graphs)

All the 2.6.23-rc kernels performed poorly (except -rc3!): much more
erratically and with a sharp performance drop above 800 threads. The
load starts to go up and the performance takes a nosedive.

With fewer threads (less than 50) there is hardly any difference at all
between all the kernels.

Notes about the tests and setup:
* environment is:
Dual Opteron 252 with 3GB ram, scsi disk, etc..
Sun Java 1.6
MySQL 5.0.44
Junit + ant + my test code (devloop.org.uk)
* java threads are created first and the data is prepared, then all the
threads are started in a tight loop. Each thread runs multiple queries
with a 10ms pause (to allow the other threads to get scheduled)
* load average is divided by the number of cpus (2)
* more general information (which also covers some irrelevant
information about some other tests I have published) is here:
http://devloop.org.uk/documentation/database-performance/Setup/

Don't shoot the messenger!
I can run some more tests if needed (bearing in mind that a full test
run takes a few hours) or you can run the tests yourself: instructions
on running the tests are included.
I am now re-testing with sched-cfs-v2.6.23-rc6-v21-combo-2.patch
but feel free to send some other patches.

Antoine
-----BEGIN PGP SIGNATURE-----
Versio...

To: Antoine Martin <antoine@...>
Cc: Linux Kernel Development <linux-kernel@...>
Date: Thursday, September 13, 2007 - 7:24 am

hm, could you try the patch below ontop of 2.6.23-rc6 and do:

echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

does this improve the numbers?

Ingo

-------------->
Subject: sched: yield debugging
From: Ingo Molnar <mingo@elte.hu>

introduce various sched_yield implementations:

# default one:
echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

# always queues the current task next to the next task:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

# NOP:
echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

tunability depends on CONFIG_SCHED_DEBUG=y.

Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/sched.h | 2 +
kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 19 +++++++++++++
3 files changed, 88 insertions(+), 6 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1392,10 +1392,12 @@ extern void sched_idle_next(void);
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_yield_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_batch_wakeup_granularity;
extern unsigned int sysctl_sched_stat_granularity;
extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_yield_bug_workaround;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;
#endif
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -42,6 +42,16 @@ const_debug unsigned int sysctl_sched_la
*/
const_debug unsigned int sysctl_sched_child_runs_first = 1;

+const_debug unsigned int sysctl_sched_yiel...

To: Antoine Martin <antoine@...>
Cc: Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Friday, September 14, 2007 - 4:32 am

the patch i sent was against CFS-devel. Could you try the one below,
which is against vanilla -rc6, does it improve the numbers? (it should
have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the
sysctl.

Ingo

------------------------>
Subject: sched: yield debugging
From: Ingo Molnar <mingo@elte.hu>

introduce various sched_yield implementations:

# default one:
echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

# always queues the current task next to the next task:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

# NOP:
echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

tunability depends on CONFIG_SCHED_DEBUG=y.

Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/sched.h | 2 +
kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 19 +++++++++++++
3 files changed, 88 insertions(+), 6 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1402,10 +1402,12 @@ extern void sched_idle_next(void);

extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_yield_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_batch_wakeup_granularity;
extern unsigned int sysctl_sched_stat_granularity;
extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_yield_bug_workaround;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read
*/
unsigned int sysctl_sched_min_granularity __read_mostly = 2000...

To: Ingo Molnar <mingo@...>
Cc: Antoine Martin <antoine@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Friday, September 14, 2007 - 6:06 am

Hi Antoine, Ingo,

Hmm, I know diddly about Java, and I don't want to preempt Antoine's next
test, but I noticed that he uses Thread.sleep() in his testcode and not the
Thread.yield() so it would be interesting if Antoine can test with this patch

This is an interesting data point, IMHO ... considering these tests are long,
I suspect you ran them only once each per kernel. So I wonder how reliable
that -rc3 testpoint is. If this oddity is reproducible, it would be great if you
could git-bisect:

1. between 23-rc1 and 23-rc3, and find out which commit led to the
improvement in performance, and,
2. between 23-rc3 and 23-rc6, and find out which commit brought down
the numbers again.

[ http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html,

Don't know much about CFS either, but does that constant "10 ms" sleep
somehow lead to evil synchronization issues between the test threads?

Thanks,

Satyam
-

To: Antoine Martin <antoine@...>
Cc: Ingo Molnar <mingo@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>, Nick Piggin <nickpiggin@...>, David Schwartz <davids@...>
Date: Friday, September 14, 2007 - 12:01 pm

[ Argh, just noticed this thread got broke and had been living a parallel
life due to Subject: changes, dropped Cc:'s, and munged In-Reply-To:'s.

Ok, it appears Antoine tested with the patch and got:

http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...

which leads me to believe this probably wasn't a yield problem after all, though
it would still be useful if someone with more knowledge of Java could give that
code a look over ...

Curiously, the -rc3 oddity is still plainly visible there -- how do we
explain that?
Ingo, does that oddity (and the commits that went in around -rc3 time) give some

Ok, it's reproducible, making our job easier. Which also means, Antoine,

I don't have access to any real/meaningful SMP systems, so I wonder how much
sense it makes in practical terms for me to try and run these tests
locally on my
little boxen ... would be helpful if someone with 4/8 CPU systems could give

Umm, you mention _changing_ this value earlier, but it still remains the same
for every thread during every loop for a given test run -- what I'm
suggesting is
making that code do something like: Thread.sleep(random(x, y)); where
x=2, y=20 and random(x, y) returns a random integer between x and y, so all
threads sleep for different durations in every loop, but still average
out to about
~10 ms over a period. Try to vary x and y (to vary the average) and post the
resulting graphs too? CONFIG_HZ (actually, full .config) and dmesg might be
useful for us as well.

Also, like David mentioned, counting the _number_ of times the test threads
managed to execute those SQL queries is probably a better benchmark than
measuring the time it takes for threads to finish execution itself -- uniformity
in that number across threads would bring out how "fair" CFS is compared to
previous kernels, for ...

To: Satyam Sharma <satyam.sharma@...>
Cc: Ingo Molnar <mingo@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>, Nick Piggin <nickpiggin@...>, David Schwartz <davids@...>
Date: Monday, September 17, 2007 - 8:17 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

It should still work... if you're patient.

http://devloop.org.uk/documentation/database-performance/dmesg-dualopter...
Good idea, I've added code to capture more fairness oriented attributes:
* time until the first thread finishes - this should actually be higher
when the scheduler is fairer.
* total number of loop iterations executed when the first thread
finishes (higher is better)
I think he does have a valid point:
maybe, the locality of the java threads is important, but there are so
many of them and so many layers below it (jdbc pool, backend database
threads, filesystem, device) that the various caches get thrashed quite
often anyway, no matter what the interleaving factor is.
Which means that being fairer in this case also means switching threads
more often and hitting the caches' capacity limits earlier.
In which case the granularity and CONFIG_HZ should have a noticeable
impact. Ingo did suggest tweaking
/proc/sys/kernel/sched_yield_granularity_ns, which I did but this
particular test timed out when It ran over the weekend... will run it
again now. (results in a couple of days)

Maybe the granularity should auto-adjust to prevent such cases?
(probably not)
Granularity (and therefore fairness) become less important as the number
of threads becomes very large, as fairness starts to adversely affects
throughput?

I have changed my mind and now think that this is not a regression, or
at least not one we should worry too much about. As David said, this
workload is "pathological".

And CFS, does show some noticeable improvements with the new
measurements (now using a shorter thread yield of 5ms and a higher
number of loop iterations per thread: 25):
It does not adversely affect throughput as much (will test with more
threads later):
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...
The number of threads that exit before we reach half the total number ...

To: Antoine Martin <antoine@...>
Cc: Ingo Molnar <mingo@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>, Nick Piggin <nickpiggin@...>, David Schwartz <davids@...>
Date: Friday, September 14, 2007 - 12:08 pm

[snip]

Ick, threading breaks in Gmail with Subject: changes, so I missed the latest
updates on this thread ... oh well, never mind.

Satyam
-

To: Satyam Sharma <satyam.sharma@...>
Cc: Ingo Molnar <mingo@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Friday, September 14, 2007 - 11:25 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

See below...
It looks good now! Updated results here:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...
Compared with more kernels here - a bit more cluttered:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...

Thanks Ingo!
Does this mean that I'll have to keep doing:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
Or are you planning on finding a more elegant solution?
# find /proc -name "*workaround*"
/proc/sys/kernel/sched_yield_bug_workaround
Yeah, I thought that was quite suspicious.
- -rc2 is just like -rc1 (see above) so I'll re-test -rc3 first,
git-bisect could take a while with those tests... just wiping the disk
I've tested before with varying timings, but I had not thought of using
a randomized delay.
Will add that too.

Many thanks to you all for the feedback!
Antoine
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG6qfkGK2zHPGK1rsRCgeEAJ9HUUtHUNScvTVKo5z2sSmo+G+BVgCfdYmK
rcd1VYUuzQA2oFEmakjZxgM=
=jmI8
-----END PGP SIGNATURE-----
-

To: Antoine Martin <antoine@...>
Cc: Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Friday, September 14, 2007 - 11:32 am

just to make sure - can you get it to work fast with the
-rc6+yield-patch solution too? (i.e. not CFS-devel) We need a (tested)
solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've
attached below the latest version of the -rc6 yield patch - the switch
is not dependent on SCHED_DEBUG anymore but always available.

Ingo

------------------->
Subject: sched: yield workaround
From: Ingo Molnar <mingo@elte.hu>

sched_yield() is fundamentally broken, and CFS has changed
its behavior. Some apps that mistakenly rely on sched_yield()
want "weak" sched yield (such as desktop apps) - while some
apps want "strong" sched_yield() (for example some JDKs).

There's no way for the scheduler to figure out which of the
two variants the app really wants - because sched_yield() is
all about hiding from the kernel the true structure of the
user-space locking code.

As a solution, provide a workaround, to introduce a more
agressive sched_yield implementation:

# default one:
echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

# always queues the current task next to the next task:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

# NOP:
echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

in the future, the use of this sysctl might generate
a deprecation warning, so that apps start moving away
from their reliance on sched_yield().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/sched.h | 2 +
kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 19 +++++++++++++
3 files changed, 88 insertions(+), 6 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1402,10 +1402,12 @@ extern void sched_idle_next(void);

extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
+extern unsigned ...

To: Ingo Molnar <mingo@...>
Cc: Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Tuesday, September 18, 2007 - 1:00 pm

Is this going to be merged? And will you be making the default == 1 or
just leaving it at 0, which forces people who want the older behavior to
modify the default?

-

To: Chuck Ebbert <cebbert@...>
Cc: Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Tuesday, September 18, 2007 - 6:46 pm

not at the moment - Antoine suggested that the workload is probably fine
and the patch against -rc6 would have no clear effect anyway so we have
nothing to merge right now. (Note that there's no "older behavior"
possible, unless we want to emulate all of the O(1) scheduler's
behavior.) But ... we could still merge something like that patch, but a
clearer testcase is needed. The JVM's i have access to work fine.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Tuesday, September 18, 2007 - 7:02 pm

I just got a bug report today:

https://bugzilla.redhat.com/show_bug.cgi?id=295071

==================================================

Description of problem:

The CFS scheduler does not seem to implement sched_yield correctly. If one
program loops with a sched_yield and another program prints out timing
information in a loop. You will see that if both are taskset to the same core
that the timing stats will be twice as long as when they are on different cores.
This problem was not in 2.6.21-1.3194 but showed up in 2.6.22.4-65 and continues
in the newest released kernel 2.6.22.5-76.

Version-Release number of selected component (if applicable):

2.6.22.4-65 through 2.6.22.5-76

How reproducible:

Very

Steps to Reproduce:
compile task1
int main() {
while (1) {
sched_yield();
}
return 0;
}

and compile task2

#include <stdio.h>
#include <sys/time.h>
int main() {
while (1) {
int i;
struct timeval t0,t1;
double usec;

gettimeofday(&t0, 0);
for (i = 0; i < 100000000; ++i)
;
gettimeofday(&t1, 0);

usec = (t1.tv_sec * 1e6 + t1.tv_usec) - (t0.tv_sec * 1e6 + t0.tv_usec);
printf ("%8.0f\n", usec);
}
return 0;
}

Then run:
"taskset -c 0 ./task1"
"taskset -c 0 ./task2"

You will see that both tasks use 50% of the CPU.
Then kill task2 and run:
"taskset -c 1 ./task2"

Now task2 will run twice as fast verifying that it is not some anomaly with the
way top calculates CPU usage with sched_yield.

Actual results:
Tasks with sched_yield do not yield like they are suppose to.

Expected results:
The sched_yield task's CPU usage should go to near 0% when another task is on
the same CPU.
-

To: Chuck Ebbert <cebbert@...>
Cc: Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>
Date: Wednesday, September 19, 2007 - 3:18 pm

sched_yield is a very poorly defined interface as it does not tell the
kernel anything about what kind of locking user-space does. When an app
calls the syscall it basically tells the scheduler: "uhm, ok, i'm
blocked and i want you to do something else now. I dont want to tell you
how long this task wants to wait, and i dont want to tell you on what
condition it should be unblocked. Just ... do some stuff and let me
alone. See ya."

That's not an intelligent interface upon which the scheduler can do
anything particularly intelligent (in fact, it's a very stupid interface
upon which the scheduler can only do stupid things), and hence
schedulers tend to implement sched_yield() quite arbitrarily and in a
very scheduler-internals dependent way - which internals change when the
scheduler is changed materially.

The correct way to tell the kernel that the task is blocked is to use
futexes for example, or any kernel-based locking or wait object - there
are myriads of APIs for these. (The only well-defined behavior of yield
is for SCHED_FIFO/RR tasks - and that is fully preserved in CFS.)

Regarding compatible behavior: it's not possible to fully emulate the
O(1) scheduler's yield behavior, because the yield implementation
depended on so many scheduler internals. We changed yield when we went
from 2.4 to 2.6, and even in 2.6 we changed it a number of times. To
avoid the reoccuring problems of applications mistakenly relying on
sched_yield(), we now context-switch on yield very weakly for
SCHED_OTHER tasks (SCHED_FIFO/RR behavior is unchanged) - this is the
only sane behavior that will get apps to stop using sched_yield() - and
that's the difference that the above testcase shows. (there's actual
user-space code executed, instead of the frequent overscheduling.)

My patch below adds a sysctl flag that triggers a context-switch when
yield is called (how futile and undefined and broken that might be), but
that would only be for legacy reasons. We could still add t...

To: Ingo Molnar <mingo@...>
Cc: Chuck Ebbert <cebbert@...>, Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>
Date: Wednesday, September 19, 2007 - 4:00 pm

Certainly this is reasonable for applications for which the source is
available and readily recompilable. However, there are legacy
closed-source apps out there expecting sched_yield() to result in a
reasonable amount of time passing before the task is scheduled again.

Also, there are installed bases of people that may have older versions
of code that may wish to upgrade to newer kernels without upgrading the
rest of the system. It seems odd to force them to update userspace apps

I've always understood one of the kernel's basic tenets to be that we
don't break userspace without a good reason. If there are apps out
there that expect sched_yield() to give up the cpu, it seems
counter-intuitive to ignore that expectation.

Personally, I'd be in favour of making the context-switch be the default
behaviour, but at the very least it should be possible to enable a
"backwards-compatibility mode" for sched_yield().

Chris
-

To: Ingo Molnar <mingo@...>
Cc: Chuck Ebbert <cebbert@...>, Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 3:39 pm

I disagree. I think CFS made "sched_yield()" worse, and what you call "bug
workaround" is likely the *better* behaviour.

The fact is, sched_yield() is not - and should not be - about
"recalculating the position in the scheduler queue" like you do now in
CFS.

It very much is about moving the thread *dead last* within its priority
group.

That's what it does for round-robin, and it's not about fairness, it's
about

- Opengroup:

DESCRIPTION

The sched_yield() function forces the running thread to
relinquish the processor until it again becomes the head of its
thread list. It takes no arguments.

- Linux man-page:

DESCRIPTION

A process can relinquish the processor voluntarily without
blocking by calling sched_yield. The process will then be moved
to the end of the queue for its static priority and a new process
gets to run.

and quite frankly, the current CFS behaviour simply looks buggy. It should
simply not move it to the "right place" in the rbtree. It should move it
*last*.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Chuck Ebbert <cebbert@...>, Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 3:56 pm

ok, we can do that.

the O(1) implementation of yield() was pretty arbitrary: it did not move
it last on the same priority level - it only did it within the active
array. So expired tasks (such as CPU hogs) would come _after_ a
yield()-ing task.

so the yield() implementation was so much tied to the data structures of
the O(1) scheduler that it was impossible to fully emulate it in CFS.

in CFS we dont have a per-nice-level rbtree, so we cannot move it dead
last within the same priority group - but we can move it dead last in
the whole tree. (then they'd be put even after nice +19 tasks.) People
might complain about _that_.

another practical problem is that this will break certain desktop apps
that do calls to yield() [some firefox plugins do that, some 3D apps do
that, etc.] but they dont expect to be moved 'very late' into the queue
- they expect the O(1) scheduler's behavior of being delayed "a bit".
(That's why i added the yield-granularity tunable.)

we can make yield super-agressive, that is pretty much the only sane
(because well-defined) thing to do (besides turning yield into a NOP),
but there will be lots of regression reports about lost interactivity
during load. sched_yield() is a mortally broken API. "fix the app" would
be the answer, but still there will be lots of complaints.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Chuck Ebbert <cebbert@...>, Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 4:28 pm

Yeah, with reasonably good reason.

The sched_yield() behaviour is actually very well-defined for RT tasks
(now, whether it's a good interface to use or not is still open to debate,
but at least it's a _defined_ interface ;), and I think we should at least
try to approximate that behaviour for normal tasks, even if they aren't
RT.

And the closest you can get is basically to say something that is close to

"sched_yield()" on a non-RT process still means that all
other runnable tasks in that same nice-level will be scheduled
before the task is scheduled again.

and I think that would be the optimal approximation of the RT behaviour.

Now, quite understandably we might not be able to actually get that
*exact* behaviour (since we mix up different nice levels), but I think
from a QoI standpoint we should really have that as a "target" behaviour.

So I think it should be moved back in the RB-tree, but really preferably
only back to behind all other processes at the same nice-level.

So I think we have two problems with yield():

- it really doesn't have very nice/good semantics in the first place for
anything but strict round-robin RT tasks.

We can't do much about this problem, apart from trying to make people
use proper locking and other models that do *not* depend on yield().

- Linux has made the problem a bit worse by then having fairly arbitrary
and different semantics over time.

This is where I think it would be a good idea to try to have the above
kind of "this is the ideal behaviour - we don't guarantee it being
precise, but at least we'd have some definition of what people should
be able to expect is the "best" behaviour.

So I don't think a "globally last" option is at all the best thing, but I
think it's likely better than what CFS does now. And if we then have some
agreement on what would be considered a further _improvement_, then that
would be a good thing - maybe we're not perfect, but at least having a
vie...

To: Linus Torvalds <torvalds@...>
Cc: Chuck Ebbert <cebbert@...>, Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 5:41 pm

i thought a bit about the statistical approach, and it's good in
principle, but it has an implementational problem/complication: if there
are only yielding tasks in the system, then the "queue rightwards in the
tree, statistically" approach cycles through the key-space artificially
fast. That can cause various problems. (this means that the
workload-flag patch that uses yield_granularity is buggy as well. The
queue-rightmost patch did not have this problem.)

So right now there are only two viable options i think: either do the
current weak thing, or do the rightmost thing. The statistical method
might work too, but it needs more thought and more testing - i'm not
sure we can get that ready for 2.6.23.

So what we have as working code right now is the two extremes, and apps
will really mostly prefer either the first (if they dont truly want to
use yield but somehow it got into their code) or the second (if they
want some massive delay). So while it does not have a good QoI, how
about doing a compat_yield sysctl that allows the turning on of the
"queue rightmost" logic? Find tested patch below.

Peter, what do you think?

Linus, if this would be acceptable for .23 then i'll push it out into
sched.git together with another fix that Hiroshi Shimamoto just posted
to lkml.

Ingo

-------------------------------------->
Subject: sched: add /proc/sys/kernel/sched_compat_yield
From: Ingo Molnar <mingo@elte.hu>

add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
more agressive, by moving the yielding task to the last position
in the rbtree.

with sched_compat_yield=0:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield
2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop

with sched_compat_yield=1:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 ...

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Chuck Ebbert <cebbert@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Tuesday, September 25, 2007 - 9:46 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

These are pure cpu scheduling tests, not doing any I/O this time.
All these tests are still "pathological" in the sense that they are only
meant to show differences between schedulers rather than try to simulate
real usage scenarios.

all the graphs are here:
http://devloop.org.uk/lkml/

Legend:
* 2.6.23-rc6-yield2: "yield2" patch is this one:
http://lkml.org/lkml/2007/9/14/157
* 2.6.23-rc6-yield2-highlatency is the same patch, but the kernel is
built without preempt, with HZ100 and the scheduling granularity is
doubled using sysctl.
* 2.6.23-rc6-yield3 is CFS-v21 combo3 + the yield patch
* 2.6.23-rc6-yield4 is this patch (queued for mainline? and in mm?):
http://lkml.org/lkml/2007/9/19/409

of interest I found:
* rc6-mm1 is not always fair (see "ManyThreadsYieldOften" tests) - the
only one to have almost half the threads already finished at the half
way point when yielding often. Also slower for the "RandomSleep".
* increasing latency makes a noticeable difference (see "ShortPause")
it can be more fair, but it also makes it a lot more erratic (see
"Yield" tests)
* most changes are only noticeable with a large number for threads (look
for 'ManyThreads' in the filename)

Summary of the code: each thread loops 25 times, incrementing a shared
counter each time and calling "iteration()" which does:
* "Long Pause" tests:
1000 times sqrt() followed by 25 milliseconds sleep.
* "Random Sleep" tests:
sleep(n) after 1000 sqrt calculations, where n is a random number
between 0 and 99 milliseconds.
* "Short Pause" Tests:
1 million sqrt() followed by 5ms sleep.
* "Yield Often" Tests:
sqrt() then yield(), both 250 times per iteration.
* "Yield" Tests:
1 million sqrt() followed by a single yield().
All the source code is here:
http://bugs.devloop.org.uk/devloop/browser/metastores/examples-test/src/...

Antoine

PS: now testing -rc8, also added a test that consumes memory in each
thread. also now rec...

To: Antoine Martin <antoine@...>
Cc: Linus Torvalds <torvalds@...>, Chuck Ebbert <cebbert@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Thursday, September 27, 2007 - 4:35 am

wrt. yield4 did you set /proc/sys/kernel/sched_compat_yield to 1?
(with sched_compat_yield at 0, which is the default, nothing
changes)

i'm wondering, how easy would it be for you to test the sched-devel.git
tree? If you havent used git before then first install the 'git'
package, then do:

git-clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6.git
cd linux-2.6.git

ok.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Chuck Ebbert <cebbert@...>, Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 5:58 pm

I have to agree that for .23 we can't do much more than this. And tasks
moving to the right without actually doing work and advancing
fair_clock do scare me a little.

Also, while I agree with Linus' definition of sched_yield, I'm afraid
it will cause 'regressions' for all the interactivity people out there.
Somehow this yield thing has made it into all sorts of unfortunate
places like video drivers, so a heavy penalizing yield will hurt them.

-

To: Linus Torvalds <torvalds@...>
Cc: Chuck Ebbert <cebbert@...>, Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 5:49 pm

it's getting late here so i've pushed the current version of those two
patches out to:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

I'll redo this git tree if we want some other solution for yield. (but i
think this is the safest approach for 2.6.23 - some apps will complain
about too strong yield, some apps will complain about too weak yield. So
by providing the two extremes we at least cover the practical range of
behavior.)

there's nothing else pending for 2.6.23 otherwise at the moment,
scheduler-wise.

Ingo

------------------>
Hiroshi Shimamoto (1):
sched: fix invalid sched_class use

Ingo Molnar (1):
sched: add /proc/sys/kernel/sched_compat_yield

include/linux/sched.h | 1
kernel/sched.c | 10 ++++---
kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 8 ++++++
4 files changed, 72 insertions(+), 10 deletions(-)
-

To: Linus Torvalds <torvalds@...>
Cc: Chuck Ebbert <cebbert@...>, Antoine Martin <antoine@...>, Satyam Sharma <satyam.sharma@...>, Linux Kernel Development <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, September 19, 2007 - 4:26 pm

find below the fix that puts yielding tasks to the rightmost position in
the rbtree. I have not tested it extensively yet, but it appears to work
on x86. (i tested yield using interactive tasks and they get hurt now
under load - but this would be expected.)

Ingo

---------------------->
Subject: sched: make yield more agressive
From: Ingo Molnar <mingo@elte.hu>

make sys_sched_yield() more agressive, by moving the yielding task
to the last position in the rbtree.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 5 +----
kernel/sched_fair.c | 39 ++++++++++++++++++++++++++++++++++-----
2 files changed, 35 insertions(+), 9 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void)
struct rq *rq = this_rq_lock();

schedstat_inc(rq, yld_cnt);
- if (unlikely(rq->nr_running == 1))
- schedstat_inc(rq, yld_act_empty);
- else
- current->sched_class->yield_task(rq, current);
+ current->sched_class->yield_task(rq, current);

/*
* Since we are going to call schedule() anyway, there's
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -902,14 +902,43 @@ static void dequeue_task_fair(struct rq
static void yield_task_fair(struct rq *rq, struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
+ struct sched_entity *rightmost, *se = &p->se;
+ struct rb_node *parent;

- __update_rq_clock(rq);
/*
- * Dequeue and enqueue the task to update its
- * position within the tree:
+ * Are we the only task in the tree?
*/
- dequeue_entity(cfs_rq, &p->se, 0);
- enqueue_entity(cfs_rq, &p->se, 0);
+ if (unlikely(cfs_rq->n...

To: Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 2:45 pm

This seems right. They're both always ready to run. They're at the same
priority. Neither ever blocks. There is no reason one should get more CPU

Nonsense. The task is always ready-to-run. There is no reason its CPU should
be low. This bug report is based on a misunderstanding of what yielding
means.

The Linux page says:

"A process can relinquish the processor voluntarily without blocking
by
calling sched_yield(). The process will then be moved to the end
of
the queue for its static priority and a new process gets to run."

Notice the "without blocking" part?

POSIX says:

"The sched_yield() function forces the running thread to relinquish the
processor until it again becomes the head of its thread list. It takes no
arguments."

CFS is perfectly complying with both of these. This bug report is a great
example of how sched_yield can be misunderstood and misused.

You can even argue that the sched_yield process should get even more CPU,
since it's voluntarily relinquishing (which should be rewarded) rather than
infinitely spinning (which should be punished). (Not that I agree with this
argument, I'm just using it to counter-balance the other argument.)

DS

-

To: <davids@...>
Cc: Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 3:48 pm

The yielding task has given up the cpu. The other task should get to
run for a timeslice (or whatever the equivalent is in CFS) until the
yielding task again "becomes head of the thread list".

Chris
-

To: <CFRIESEN@...>
Cc: Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 6:56 pm

Are you sure this isn't happening? I'll run some tests on my SMP system
running CFS. But I'll bet the context switch rate is quite rapid.

Honestly, I can't imagine what else could be happening here. Does CFS spin
in a loop doing nothing when you call sched_yield even though another task
is ready-to-run? That seems kind of bizarre. Is sched_yield acting as a
no-op?

DS

-

To: <CFRIESEN@...>
Cc: Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 7:05 pm

Yep, that's exactly what's happening. The tasks are alternating. They are
both always ready-to-run. The yielding task is put at the end of the queue
for its priority level.

There is no reason the yielding task should get less CPU since they're both
always ready-to-run.

The only downside here is that a yielding task results in very small
timeslices which causes cache inefficiencies. A sane lower bound on the
timeslice might be a good idea. But there is no semantic problem.

DS

-

To: <CFRIESEN@...>
Cc: Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 7:52 pm

Ack, sorry, I'm wrong.

Please ignore me, if you weren't already.

I'm glad to hear this will be fixed. The task should be moved last for its
priority level.

DS

-

To: Linux Kernel Development <linux-kernel@...>
Date: Thursday, September 13, 2007 - 3:18 am

I've tried reasonalby diligently to figure out what the hell you're doing and gone through quite a bit of your documentation, and I just can't figure it out. This could entirely be the result of your test's sensitivity to execution order.

For example, if you run ten threads that all insert, query, and delete from the *same* table, then the exact interleaving pattern will determine the size of the results. A slight change in the scheduling quantum could multiply the size of the result data by a huge factor. There is a big difference between:

1) Thread A inserts data.
2) Thread A queries data.
3) Thread A deletes data.
4) Thread B inserts data.
...

and
1) Thread A inserts data.
2) Thread B insers data.
...
101) Thread A queries data.
102) Thread B queries data.
...

There are a number of ways you might be measuring nothing but how the scheduler chooses to interleave your threads. Benchmarking threads that yield suggests just this type of thing -- if a thread has useful work to do and another thread is not going to help it, *why* *yield*?

Are you worried the scheduler isn't going to schedule other threads?! Or is there some sane reason to force suboptimal scheduling when you're trying to benchmark a scheduler? Are you trying to see how it deals with pathological patterns? ;)

The only documentation I can see about what you're actually *doing* says things like "The schema and statements are almost identical to the non-threaded tests." Do you see why that's not helpful?

DS

-

To: <davids@...>, Antoine Martin <antoine@...>
Cc: Linux Kernel Development <linux-kernel@...>
Date: Wednesday, September 12, 2007 - 7:33 pm

(cc's readded please reply to all when replying to lkml)

Hi David,

You might be sounding a bit too abrasive here... I understand you're
also trying to help, but your tone just might be taken the wrong way.

Antonie is really doing the right thing here to test such a new feature
early and on the code he cares about as a user. And most importantly,
reporting it here. This is probably the most useful resource we have in
Linux.

Maybe the workload is quirky, but regardless, if it is a *regression*
from a previous kernel then it is really important to be brought to our
-

To: Nick Piggin <nickpiggin@...>
Cc: <davids@...>, Linux Kernel Development <linux-kernel@...>
Date: Thursday, September 13, 2007 - 3:02 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

I've tested a couple more kernels: 2.6.23-rc4-mm1 and 2.6.23-rc6 with
the "sched_yield_bug_workaround" patch from Ingo, results are here:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...
Same, with the load average (dotted lines):
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/K...
You are correct, it is very sensitive to the execution order and
caching. When I vary the thread pause, the total execution time varies
widely. 10ms just happens to be the sweet spot which provides the best
As I said above, I have tried varying the pause and this is the sweet
spot. If you don't yield, the first hundred threads will be finished by
the time you start the 800th - which is not what you want.
I could try just yielding at the start, or only during the first
I know it sounds sub-optimal, but this benchmark wasn't designed to test
the scheduler! It is meant to thrash just one database table. It isn't
meant to be anything like TPC either.
Previously it did uncover some very noticeable differences in JVM
I sure do! I'll try to improve on that when I get a chance. Here is a
start: the schema is configurable with simple text files. And the
database layer translates it into the SQL syntax (and types) supported
by the backend (MySQL in this case):
# cat ManyThreadedCombinedTest.properties
table=update
columns=i1:integer,i2:integer,s1:varchar,s2:varchar
I'll make a click&run tarball if anyone is interested in running the
tests themselves (it outputs csv data).

Thanks for the feedback.

Antoine
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG6Yk8GK2zHPGK1rsRCtVaAKCAVyU4woztnl6q0NAZBNsM94I2JgCcD4M3
+MpR1gAom65Mn/tQX8ig1cI=
=Q3IY
-----END PGP SIGNATURE-----
-

To: Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>
Date: Thursday, September 13, 2007 - 5:47 pm

First, let me apologize if the tone of my other post came through as angry or frustrated. Text can sometimes be misleading as to tone. I certainly wasn't angry.

Second, let me say that I'm definitely not suggesting that you were wrong to bring this to everyone's attention. Even if it turns out that your code is horribly broken and it's all your fault, any apparent regression still has to be investigated. It is virtually certain that we will learn something interesting about either your code, CFS, or both.

I was definitely *not* saying "how dare you challenge CFS' supremacy without a perfect test case".

The problem is that because your access pattern is pathological, schedulers that are objectively worse may turn in much better performance. For example, a scheduler that completely ignores your attempt to sleep will probably perform significantly better than a scheduler that goes out of its way to honor it.

That the execution time is very sensitive to the pause is strong evidence of this.

The problem is simply that your test program doesn't do a fixed amount of work. It does a variable amount of work, and that amount depends upon scheduler details. It's like a job that has fifty threads that do this:

1) Increment a shared variable.
2) Do a math problem a number of times equal to the value of that shared variable.
3) Decrement the shared variable.

The problem is that the number of times the math problem is done depends upon the execution order of the threads. To be fair, you would need to benchmark how many times the math problem gets done, not how long the threads take to complete.

Now, imagine if I insert a yield between steps 1 and 2. The more a scheduler honors your yield request, the worse it will appear to perform. The scheduler that ignores it (which is legal, but definitely not The Right Thing) will seem to perform *much* better.

Of course, it's also possible that this is not what's going on.

DS

-

Previous thread: Re: [linux-dvb] [PATCH] Userspace tuner by Markus Rechberger on Wednesday, September 12, 2007 - 7:10 pm. (56 messages)

Next thread: [RFC PATCH] Add a 'minimal tree install' target by Chris Wedgwood on Wednesday, September 12, 2007 - 7:25 pm. (12 messages)