A potential bug reported against the Completely Fair Scheduler [1] suggested that it was causing a network slowdown, measured with the 'Iperf' bandwidth performance benchmarking tool. The performance hit was quickly tracked to the previously discussed [2] changes in how CFS handles sched_yield() [3]. When it was suggested that this was a bug in the new process scheduler, Ingo explained:
"I had a quick look at the source code, and the reason for that weird yield usage was that there's a locking bug in iperf's 'Reporter thread' abstraction and apparently instead of fixing the bug it was worked around via a horrible yield() based user-space lock."
He then submit a small patch to fix the bug and remove the call to sched_yield() resulting in, "iperf uses _much_ less CPU time. On my Core2Duo test system, before the patch it used up 100% CPU time to saturate 1 gigabit of network traffic to another box. With the patch applied it now uses 9% of CPU time." He added playfully, "sched_yield() is almost always the symptom of broken locking or other bug. In that sense CFS does the right thing by exposing such bugs =B-)" Stephen Hemminger pointed out that a similar patch had been submitted to the Iperf project last month as it caused an identical problem with FreeBSD's scheduler.
From: Martin Michlmayr <tbm@...>
Subject: Network slowdown due to CFS
[3]Date: Sep 26, 4:52 am 2007
I noticed that my network performance has gone down from 2.6.22
from [ 3] 0.0-10.0 sec 113 MBytes 95.0 Mbits/sec
to [ 3] 0.0-10.0 sec 75.7 MBytes 63.3 Mbits/sec
with 2.6.23-rc1 (and 2.6.23-rc8), as measured with iperf.
I did a git bisect today and tracked it back to the commit where CFS
was enabled ("sched: cfs core code; apply the CFS core code",
commit dd41f596cda0d7d6e4a8b139ffdfabcefdd46528). I also compiled a
kernel from
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
but things don't improve.
This is on a Thecus N2100, an ARM (Intel IOP32x) based storage device
with a r8169 card, SATA disks and 512 MB RAM. My config is attached.
What kind of information can I supply so you can track this down?
--
Martin Michlmayr
http://www.cyrius.com/ [4]
From: Mike Galbraith <efault@...>
Subject: Re: Network slowdown due to CFS
[4]Date: Sep 26, 6:23 am 2007
On Wed, 2007-09-26 at 10:52 +0200, Martin Michlmayr wrote:
> I noticed that my network performance has gone down from 2.6.22
> >from [ 3] 0.0-10.0 sec 113 MBytes 95.0 Mbits/sec
> to [ 3] 0.0-10.0 sec 75.7 MBytes 63.3 Mbits/sec
> with 2.6.23-rc1 (and 2.6.23-rc8), as measured with iperf.
>
> I did a git bisect today and tracked it back to the commit where CFS
> was enabled ("sched: cfs core code; apply the CFS core code",
> commit dd41f596cda0d7d6e4a8b139ffdfabcefdd46528). I also compiled a
> kernel from
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
> but things don't improve.
>
> This is on a Thecus N2100, an ARM (Intel IOP32x) based storage device
> with a r8169 card, SATA disks and 512 MB RAM. My config is attached.
>
> What kind of information can I supply so you can track this down?
I noticed on the iperf website a patch which contains sched_yield().
http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt [5]
Do you have that patch applied by any chance? If so, it might be a
worth while to try it without it.
-Mike
-
From: Martin Michlmayr <tbm@...>
Subject: Re: Network slowdown due to CFS
[5]Date: Sep 26, 6:48 am 2007
* Mike Galbraith <efault@gmx.de> [2007-09-26 12:23]:
> I noticed on the iperf website a patch which contains sched_yield().
> http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt [6]
>
> Do you have that patch applied by any chance? If so, it might be a
> worth while to try it without it.
Yes, this patch was applied. When I revert it, I get the same (high)
performance with both kernels.
--
Martin Michlmayr
http://www.cyrius.com/ [7]
-
From: Ingo Molnar <mingo@...>
Subject: Re: Network slowdown due to CFS
[7]Date: Sep 26, 7:21 am 2007
* Martin Michlmayr <tbm@cyrius.com> wrote:
> * Mike Galbraith <efault@gmx.de> [2007-09-26 12:23]:
> > I noticed on the iperf website a patch which contains sched_yield().
> > http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt [8]
> >
> > Do you have that patch applied by any chance? If so, it might be a
> > worth while to try it without it.
>
> Yes, this patch was applied. When I revert it, I get the same (high)
> performance with both kernels.
great! Could you try this too:
echo 1 > /proc/sys/kernel/sched_compat_yield
does it fix iperf performance too (with the yield patch applied to
iperf)?
I think the real fix would be for iperf to use blocking network IO
though, or maybe to use a POSIX mutex or POSIX semaphores.
Ingo
-
From: Martin Michlmayr <tbm@...>
Subject: Re: Network slowdown due to CFS
[8]Date: Sep 26, 7:29 am 2007
* Ingo Molnar <mingo@elte.hu> [2007-09-26 13:21]:
> > > I noticed on the iperf website a patch which contains sched_yield().
> > > http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt [9]
>
> great! Could you try this too:
> echo 1 > /proc/sys/kernel/sched_compat_yield
>
> does it fix iperf performance too (with the yield patch applied to
> iperf)?
Yes, this gives me good performance too.
> I think the real fix would be for iperf to use blocking network IO
> though, or maybe to use a POSIX mutex or POSIX semaphores.
So it's definitely not a bug in the kernel, only in iperf?
(CCing Stephen Hemminger who wrote the iperf patch.)
--
Martin Michlmayr
http://www.cyrius.com/ [10]
-
From: David Schwartz <davids@...>
Subject: RE: Network slowdown due to CFS
[10]Date: Sep 26, 8:00 am 2007
> > I think the real fix would be for iperf to use blocking network IO
> > though, or maybe to use a POSIX mutex or POSIX semaphores.
>
> So it's definitely not a bug in the kernel, only in iperf?
Martin:
Actually, in this case I think iperf is doing the right thing (though not
the best thing) and the kernel is doing the wrong thing. It's calling
'sched_yield' to ensure that every other thread gets a chance to run before
the current thread runs again. CFS is not doing that, allowing the yielding
thread to hog the CPU to the exclusion of the other threads. (It can allow
the yielding thread to hog the CPU, of course, just not to the exclusion of
other threads.)
It's still better to use some kind of rational synchronization primitive
(like mutex/sempahore) so that the other threads can tell you when there's
something for you to do. It's still better to use blocking network IO, so
the kernel will let you know exactly when to try I/O and your dynamic
priority can rise.
Ingo:
Can you clarify what CFS' current default sched_yield implementation is and
what setting sched_compat_yield to 1 does? Which way do we get the right
semantics (all threads of equal static priority are scheduled, with some
possible SMP fuzziness, before this thread is scheduled again)?
DS
-
From: Ingo Molnar <mingo@...>
Subject: Re: Network slowdown due to CFS
[10]Date: Sep 26, 9:31 am 2007
* David Schwartz <davids@webmaster.com> wrote:
> > > I think the real fix would be for iperf to use blocking network IO
> > > though, or maybe to use a POSIX mutex or POSIX semaphores.
> >
> > So it's definitely not a bug in the kernel, only in iperf?
>
> Martin:
>
> Actually, in this case I think iperf is doing the right thing (though not
> the best thing) and the kernel is doing the wrong thing. [...]
it's not doing the right thing at all. I had a quick look at the source
code, and the reason for that weird yield usage was that there's a
locking bug in iperf's "Reporter thread" abstraction and apparently
instead of fixing the bug it was worked around via a horrible yield()
based user-space lock.
the (small) patch below fixes the iperf locking bug and removes the
yield() use. There are numerous immediate benefits of this patch:
- iperf uses _much_ less CPU time. On my Core2Duo test system, before
the patch it used up 100% CPU time to saturate 1 gigabit of network
traffic to another box. With the patch applied it now uses 9% of
CPU time.
- sys_sched_yield() is removed altogether
- i was able to measure much higher bandwidth over localhost for
example. This is the case for over-the-network measurements as well.
- the results are also more consistent and more deterministic, hence
more reliable as a benchmarking tool. (the reason for that is that
more CPU time is spent on actually delivering packets, instead of
mindlessly polling on the user-space "lock", so we actually max out
the CPU, instead of relying on the random proportion the workload was
able to make progress versus wasting CPU time on polling.)
sched_yield() is almost always the symptom of broken locking or other
bug. In that sense CFS does the right thing by exposing such bugs =B-)
Ingo
------------------------->
Subject: iperf: fix locking
From: Ingo Molnar <mingo@elte.hu>
fix iperf locking - it was burning CPU time while polling
unnecessarily, instead of using the proper wait primitives.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
compat/Thread.c | 3 ---
src/Reporter.c | 13 +++++++++----
src/main.cpp | 2 ++
3 files changed, 11 insertions(+), 7 deletions(-)
Index: iperf-2.0.2/compat/Thread.c
===================================================================
--- iperf-2.0.2.orig/compat/Thread.c
+++ iperf-2.0.2/compat/Thread.c
@@ -405,9 +405,6 @@ int thread_numuserthreads( void ) {
void thread_rest ( void ) {
#if defined( HAVE_THREAD )
#if defined( HAVE_POSIX_THREAD )
- // TODO add checks for sched_yield or pthread_yield and call that
- // if available
- usleep( 0 );
#else // Win32
SwitchToThread( );
#endif
Index: iperf-2.0.2/src/Reporter.c
===================================================================
--- iperf-2.0.2.orig/src/Reporter.c
+++ iperf-2.0.2/src/Reporter.c
@@ -111,6 +111,7 @@ report_statistics multiple_reports[kRepo
char buffer[64]; // Buffer for printing
ReportHeader *ReportRoot = NULL;
extern Condition ReportCond;
+extern Condition ReportDoneCond;
int reporter_process_report ( ReportHeader *report );
void process_report ( ReportHeader *report );
int reporter_handle_packet( ReportHeader *report );
@@ -338,7 +339,7 @@ void ReportPacket( ReportHeader* agent,
// item
while ( index == 0 ) {
Condition_Signal( &ReportCond );
- thread_rest();
+ Condition_Wait( &ReportDoneCond );
index = agent->reporterindex;
}
agent->agentindex = 0;
@@ -346,7 +347,7 @@ void ReportPacket( ReportHeader* agent,
// Need to make sure that reporter is not about to be "lapped"
while ( index - 1 == agent->agentindex ) {
Condition_Signal( &ReportCond );
- thread_rest();
+ Condition_Wait( &ReportDoneCond );
index = agent->reporterindex;
}
@@ -553,6 +554,7 @@ void reporter_spawn( thread_Settings *th
}
Condition_Unlock ( ReportCond );
+again:
if ( ReportRoot != NULL ) {
ReportHeader *temp = ReportRoot;
//Condition_Unlock ( ReportCond );
@@ -575,9 +577,12 @@ void reporter_spawn( thread_Settings *th
// finished with report so free it
free( temp );
Condition_Unlock ( ReportCond );
+ Condition_Signal( &ReportDoneCond );
+ if (ReportRoot)
+ goto again;
}
- // yield control of CPU is another thread is waiting
- thread_rest();
+ Condition_Signal( &ReportDoneCond );
+ usleep(10000);
} else {
//Condition_Unlock ( ReportCond );
}
Index: iperf-2.0.2/src/main.cpp
===================================================================
--- iperf-2.0.2.orig/src/main.cpp
+++ iperf-2.0.2/src/main.cpp
@@ -96,6 +96,7 @@ extern "C" {
// records being accessed in a report and also to
// serialize modification of the report list
Condition ReportCond;
+ Condition ReportDoneCond;
}
// global variables only accessed within this file
@@ -141,6 +142,7 @@ int main( int argc, char **argv ) {
// Initialize global mutexes and conditions
Condition_Initialize ( &ReportCond );
+ Condition_Initialize ( &ReportDoneCond );
Mutex_Initialize( &groupCond );
Mutex_Initialize( &clients_mutex );
-
From: Stephen Hemminger <shemminger@...>
Subject: Re: Network slowdown due to CFS
[10]Date: Sep 26, 11:46 am 2007
Here is the combined fixes from iperf-users list.
Begin forwarded message:
Date: Thu, 30 Aug 2007 15:55:22 -0400
From: "Andrew Gallatin" <gallatin@gmail.com>
To: iperf-users@dast.nlanr.net [11]
Subject: [PATCH] performance fixes for non-linux
Hi,
I've attached a patch which gives iperf similar performance to netperf
on my FreeBSD, MacOSX and Solaris hosts. It does not seem to
negatively impact Linux. I only started looking at the iperf source
yesterday, so I don't really expect this to be integrated as is, but a
patch is worth a 1000 words :)
Background: On both Solaris and FreeBSD, there are 2 things slowing
iperf down: The gettimeofday timestamp around each socket read/write
is terribly expensive, and the sched_yield() or usleep(0) causes iperf
to take 100% of the time (system time on BSD, split user/system time
on Solaris and MacOSX), which slows things down and confuses the
scheduler.
To address the gettimeofday() issue, I treat TCP different than UDP,
and TCP tests behave as though only a single (huge) packet was sent.
Rather then ending the test based on polling gettimeofday()
timestamps, an interval timer / sigalarm handler is used. I had
to increase the packetLen from an int to a max_size_t.
To address the sched_yield/usleep issue, I put the reporter thread
to sleep on a condition variable. For the TCP tests at least, there
is no reason to have it running during the test and it is best
to just get it out of the way rather than burning CPU in a tight
loop.
I've also incorporated some fixes from the FreeBSD ports collection:
--- include/headers.h
use a 64-bit type for max_size_t
--- compat/Thread.c
oldTID is not declared anywhere. Make this compile
(seems needed for at least FreeBSD & MacOSX)
--- src/Client.cpp
BSDs can return ENOBUFS during a UDP test when the socket
buffer fills. Don't exit when this happens.
I've run the resulting iperf on FreeBSD, Solaris, MacOSX and Linux,
and it seems to work for me. It is nice not to have a 100% CPU
load when running an iperf test across a 100Mb/s network.
DrewRelated links:
- Archive of above thread [11]