login
Header Space

 
 

Finding Bugs With CFS

September 26, 2007 - 8:25pm
Submitted by Jeremy on September 26, 2007 - 8:25pm.
Linux news

A potential bug reported against the Completely Fair Scheduler suggested that it was causing a network slowdown, measured with the 'Iperf' bandwidth performance benchmarking tool. The performance hit was quickly tracked to the previously discussed changes in how CFS handles sched_yield(). When it was suggested that this was a bug in the new process scheduler, Ingo explained:

"I had a quick look at the source code, and the reason for that weird yield usage was that there's a locking bug in iperf's 'Reporter thread' abstraction and apparently instead of fixing the bug it was worked around via a horrible yield() based user-space lock."

He then submit a small patch to fix the bug and remove the call to sched_yield() resulting in, "iperf uses _much_ less CPU time. On my Core2Duo test system, before the patch it used up 100% CPU time to saturate 1 gigabit of network traffic to another box. With the patch applied it now uses 9% of CPU time." He added playfully, "sched_yield() is almost always the symptom of broken locking or other bug. In that sense CFS does the right thing by exposing such bugs =B-)" Stephen Hemminger pointed out that a similar patch had been submitted to the Iperf project last month as it caused an identical problem with FreeBSD's scheduler.


From: Martin Michlmayr <tbm@...>
Subject: Network slowdown due to CFS
Date: Sep 26, 4:52 am 2007

I noticed that my network performance has gone down from 2.6.22
from   [  3]  0.0-10.0 sec    113 MBytes  95.0 Mbits/sec
to     [  3]  0.0-10.0 sec   75.7 MBytes  63.3 Mbits/sec
with 2.6.23-rc1 (and 2.6.23-rc8), as measured with iperf.

I did a git bisect today and tracked it back to the commit where CFS
was enabled ("sched: cfs core code; apply the CFS core code",
commit dd41f596cda0d7d6e4a8b139ffdfabcefdd46528).  I also compiled a
kernel from
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
but things don't improve.

This is on a Thecus N2100, an ARM (Intel IOP32x) based storage device
with a r8169 card, SATA disks and 512 MB RAM.  My config is attached.

What kind of information can I supply so you can track this down?
-- 
Martin Michlmayr
http://www.cyrius.com/

From: Mike Galbraith <efault@...> Subject: Re: Network slowdown due to CFS Date: Sep 26, 6:23 am 2007 On Wed, 2007-09-26 at 10:52 +0200, Martin Michlmayr wrote: > I noticed that my network performance has gone down from 2.6.22 > >from [ 3] 0.0-10.0 sec 113 MBytes 95.0 Mbits/sec > to [ 3] 0.0-10.0 sec 75.7 MBytes 63.3 Mbits/sec > with 2.6.23-rc1 (and 2.6.23-rc8), as measured with iperf. > > I did a git bisect today and tracked it back to the commit where CFS > was enabled ("sched: cfs core code; apply the CFS core code", > commit dd41f596cda0d7d6e4a8b139ffdfabcefdd46528). I also compiled a > kernel from > git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git > but things don't improve. > > This is on a Thecus N2100, an ARM (Intel IOP32x) based storage device > with a r8169 card, SATA disks and 512 MB RAM. My config is attached. > > What kind of information can I supply so you can track this down? I noticed on the iperf website a patch which contains sched_yield(). http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt Do you have that patch applied by any chance? If so, it might be a worth while to try it without it. -Mike -
From: Martin Michlmayr <tbm@...> Subject: Re: Network slowdown due to CFS Date: Sep 26, 6:48 am 2007 * Mike Galbraith <efault@gmx.de> [2007-09-26 12:23]: > I noticed on the iperf website a patch which contains sched_yield(). > http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt > > Do you have that patch applied by any chance? If so, it might be a > worth while to try it without it. Yes, this patch was applied. When I revert it, I get the same (high) performance with both kernels. -- Martin Michlmayr http://www.cyrius.com/ -
From: Ingo Molnar <mingo@...> Subject: Re: Network slowdown due to CFS Date: Sep 26, 7:21 am 2007 * Martin Michlmayr <tbm@cyrius.com> wrote: > * Mike Galbraith <efault@gmx.de> [2007-09-26 12:23]: > > I noticed on the iperf website a patch which contains sched_yield(). > > http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt > > > > Do you have that patch applied by any chance? If so, it might be a > > worth while to try it without it. > > Yes, this patch was applied. When I revert it, I get the same (high) > performance with both kernels. great! Could you try this too: echo 1 > /proc/sys/kernel/sched_compat_yield does it fix iperf performance too (with the yield patch applied to iperf)? I think the real fix would be for iperf to use blocking network IO though, or maybe to use a POSIX mutex or POSIX semaphores. Ingo -
From: Martin Michlmayr <tbm@...> Subject: Re: Network slowdown due to CFS Date: Sep 26, 7:29 am 2007 * Ingo Molnar <mingo@elte.hu> [2007-09-26 13:21]: > > > I noticed on the iperf website a patch which contains sched_yield(). > > > http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt > > great! Could you try this too: > echo 1 > /proc/sys/kernel/sched_compat_yield > > does it fix iperf performance too (with the yield patch applied to > iperf)? Yes, this gives me good performance too. > I think the real fix would be for iperf to use blocking network IO > though, or maybe to use a POSIX mutex or POSIX semaphores. So it's definitely not a bug in the kernel, only in iperf? (CCing Stephen Hemminger who wrote the iperf patch.) -- Martin Michlmayr http://www.cyrius.com/ -
From: David Schwartz <davids@...> Subject: RE: Network slowdown due to CFS Date: Sep 26, 8:00 am 2007 > > I think the real fix would be for iperf to use blocking network IO > > though, or maybe to use a POSIX mutex or POSIX semaphores. > > So it's definitely not a bug in the kernel, only in iperf? Martin: Actually, in this case I think iperf is doing the right thing (though not the best thing) and the kernel is doing the wrong thing. It's calling 'sched_yield' to ensure that every other thread gets a chance to run before the current thread runs again. CFS is not doing that, allowing the yielding thread to hog the CPU to the exclusion of the other threads. (It can allow the yielding thread to hog the CPU, of course, just not to the exclusion of other threads.) It's still better to use some kind of rational synchronization primitive (like mutex/sempahore) so that the other threads can tell you when there's something for you to do. It's still better to use blocking network IO, so the kernel will let you know exactly when to try I/O and your dynamic priority can rise. Ingo: Can you clarify what CFS' current default sched_yield implementation is and what setting sched_compat_yield to 1 does? Which way do we get the right semantics (all threads of equal static priority are scheduled, with some possible SMP fuzziness, before this thread is scheduled again)? DS -
From: Ingo Molnar <mingo@...> Subject: Re: Network slowdown due to CFS Date: Sep 26, 9:31 am 2007 * David Schwartz <davids@webmaster.com> wrote: > > > I think the real fix would be for iperf to use blocking network IO > > > though, or maybe to use a POSIX mutex or POSIX semaphores. > > > > So it's definitely not a bug in the kernel, only in iperf? > > Martin: > > Actually, in this case I think iperf is doing the right thing (though not > the best thing) and the kernel is doing the wrong thing. [...] it's not doing the right thing at all. I had a quick look at the source code, and the reason for that weird yield usage was that there's a locking bug in iperf's "Reporter thread" abstraction and apparently instead of fixing the bug it was worked around via a horrible yield() based user-space lock. the (small) patch below fixes the iperf locking bug and removes the yield() use. There are numerous immediate benefits of this patch: - iperf uses _much_ less CPU time. On my Core2Duo test system, before the patch it used up 100% CPU time to saturate 1 gigabit of network traffic to another box. With the patch applied it now uses 9% of CPU time. - sys_sched_yield() is removed altogether - i was able to measure much higher bandwidth over localhost for example. This is the case for over-the-network measurements as well. - the results are also more consistent and more deterministic, hence more reliable as a benchmarking tool. (the reason for that is that more CPU time is spent on actually delivering packets, instead of mindlessly polling on the user-space "lock", so we actually max out the CPU, instead of relying on the random proportion the workload was able to make progress versus wasting CPU time on polling.) sched_yield() is almost always the symptom of broken locking or other bug. In that sense CFS does the right thing by exposing such bugs =B-) Ingo -------------------------> Subject: iperf: fix locking From: Ingo Molnar <mingo@elte.hu> fix iperf locking - it was burning CPU time while polling unnecessarily, instead of using the proper wait primitives. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- compat/Thread.c | 3 --- src/Reporter.c | 13 +++++++++---- src/main.cpp | 2 ++ 3 files changed, 11 insertions(+), 7 deletions(-) Index: iperf-2.0.2/compat/Thread.c =================================================================== --- iperf-2.0.2.orig/compat/Thread.c +++ iperf-2.0.2/compat/Thread.c @@ -405,9 +405,6 @@ int thread_numuserthreads( void ) { void thread_rest ( void ) { #if defined( HAVE_THREAD ) #if defined( HAVE_POSIX_THREAD ) - // TODO add checks for sched_yield or pthread_yield and call that - // if available - usleep( 0 ); #else // Win32 SwitchToThread( ); #endif Index: iperf-2.0.2/src/Reporter.c =================================================================== --- iperf-2.0.2.orig/src/Reporter.c +++ iperf-2.0.2/src/Reporter.c @@ -111,6 +111,7 @@ report_statistics multiple_reports[kRepo char buffer[64]; // Buffer for printing ReportHeader *ReportRoot = NULL; extern Condition ReportCond; +extern Condition ReportDoneCond; int reporter_process_report ( ReportHeader *report ); void process_report ( ReportHeader *report ); int reporter_handle_packet( ReportHeader *report ); @@ -338,7 +339,7 @@ void ReportPacket( ReportHeader* agent, // item while ( index == 0 ) { Condition_Signal( &ReportCond ); - thread_rest(); + Condition_Wait( &ReportDoneCond ); index = agent->reporterindex; } agent->agentindex = 0; @@ -346,7 +347,7 @@ void ReportPacket( ReportHeader* agent, // Need to make sure that reporter is not about to be "lapped" while ( index - 1 == agent->agentindex ) { Condition_Signal( &ReportCond ); - thread_rest(); + Condition_Wait( &ReportDoneCond ); index = agent->reporterindex; } @@ -553,6 +554,7 @@ void reporter_spawn( thread_Settings *th } Condition_Unlock ( ReportCond ); +again: if ( ReportRoot != NULL ) { ReportHeader *temp = ReportRoot; //Condition_Unlock ( ReportCond ); @@ -575,9 +577,12 @@ void reporter_spawn( thread_Settings *th // finished with report so free it free( temp ); Condition_Unlock ( ReportCond ); + Condition_Signal( &ReportDoneCond ); + if (ReportRoot) + goto again; } - // yield control of CPU is another thread is waiting - thread_rest(); + Condition_Signal( &ReportDoneCond ); + usleep(10000); } else { //Condition_Unlock ( ReportCond ); } Index: iperf-2.0.2/src/main.cpp =================================================================== --- iperf-2.0.2.orig/src/main.cpp +++ iperf-2.0.2/src/main.cpp @@ -96,6 +96,7 @@ extern "C" { // records being accessed in a report and also to // serialize modification of the report list Condition ReportCond; + Condition ReportDoneCond; } // global variables only accessed within this file @@ -141,6 +142,7 @@ int main( int argc, char **argv ) { // Initialize global mutexes and conditions Condition_Initialize ( &ReportCond ); + Condition_Initialize ( &ReportDoneCond ); Mutex_Initialize( &groupCond ); Mutex_Initialize( &clients_mutex ); -
From: Stephen Hemminger <shemminger@...> Subject: Re: Network slowdown due to CFS Date: Sep 26, 11:46 am 2007 Here is the combined fixes from iperf-users list. Begin forwarded message: Date: Thu, 30 Aug 2007 15:55:22 -0400 From: "Andrew Gallatin" <gallatin@gmail.com> To: iperf-users@dast.nlanr.net Subject: [PATCH] performance fixes for non-linux Hi, I've attached a patch which gives iperf similar performance to netperf on my FreeBSD, MacOSX and Solaris hosts. It does not seem to negatively impact Linux. I only started looking at the iperf source yesterday, so I don't really expect this to be integrated as is, but a patch is worth a 1000 words :) Background: On both Solaris and FreeBSD, there are 2 things slowing iperf down: The gettimeofday timestamp around each socket read/write is terribly expensive, and the sched_yield() or usleep(0) causes iperf to take 100% of the time (system time on BSD, split user/system time on Solaris and MacOSX), which slows things down and confuses the scheduler. To address the gettimeofday() issue, I treat TCP different than UDP, and TCP tests behave as though only a single (huge) packet was sent. Rather then ending the test based on polling gettimeofday() timestamps, an interval timer / sigalarm handler is used. I had to increase the packetLen from an int to a max_size_t. To address the sched_yield/usleep issue, I put the reporter thread to sleep on a condition variable. For the TCP tests at least, there is no reason to have it running during the test and it is best to just get it out of the way rather than burning CPU in a tight loop. I've also incorporated some fixes from the FreeBSD ports collection: --- include/headers.h use a 64-bit type for max_size_t --- compat/Thread.c oldTID is not declared anywhere. Make this compile (seems needed for at least FreeBSD & MacOSX) --- src/Client.cpp BSDs can return ENOBUFS during a UDP test when the socket buffer fills. Don't exit when this happens. I've run the resulting iperf on FreeBSD, Solaris, MacOSX and Linux, and it seems to work for me. It is nice not to have a 100% CPU load when running an iperf test across a 100Mb/s network. Drew


Next step

September 27, 2007 - 10:01am
Anonymous (not verified)

Now you just need to figure out which OS/kernel iperf's developpers use, write a "fair scheduler" for it, and the bug should get quickly fixed. ;-)

> sched_yield() is almost

September 28, 2007 - 5:21am

> sched_yield() is almost always the symptom of broken locking or other bug.

Then maybe it should be patched to print a big warning each time it is called. At least to syslog.

;-)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary