Re: [PATCH] sched: properly account IRQ and RT load in SCHED_OTHER load balancing

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Ingo Molnar <mingo@...>
Cc: Peter Zijlstra <peterz@...>, Nick Piggin <nickpiggin@...>, vatsa <vatsa@...>, linux-kernel <linux-kernel@...>, D. Bahi <dbahi@...>
Date: Thursday, August 21, 2008 - 8:26 am

Ingo Molnar wrote:

I used to have a great demo for the prototype I was working on, but id=20
have to dig it up.  The gist of it is that the pre-patched scheduler=20
basically gets thrown for a completely loop in the presence of a mixed=20
CFS/RT environment.  This isn't a PREEMPT_RT specific problem per se,=20
though PREEMPT_RT does bring the problem to the forefront since it has=20
so many active RT tasks by default (for the IRQs, etc) which make it=20
more evident.

Since an RT tasks previous usage of declaring "load" did not actually=20
express the true nature of the RQ load, CFS tasks would have a few=20
really nasty things happen to them while trying to run on the system=20
simultaneously.  One of them was that you could starve out CFS tasks=20
from certain cores (even though there was plenty of CPU bandwidth=20
available elsewhere) and the load-balancer would think everything is=20
fine and thus fail to make adjustments.

Say you have a 4 core system.  You could, for instance, get into a=20
situation where the softirq-net-rx thread was consuming 80% of core 0,=20
yet the load balancer would still spread, say, a 40 thread CFS load=20
evenly across all cores (approximately 10 per core, though you would=20
account for the "load" that the softirq thread contributed too).  The=20
threads on the other cores would of course enjoy 100% bandwidth, while=20
the ~10 threads on core 0 would only see 1/5th of that bandwidth.

What it comes down to is that the CFS load should have been evenly=20
distributed across the available bandwidth of 3*100% + 1*20%, not 4*100% =

as it does today.  The net result is that the application performs in a=20
very lopsided manner, with some threads getting significantly less (or=20
sometimes zero!) cpu time compared to their peers.  You can make this=20
more obvious by nice'ing the CFS load up as high as it will go, which=20
will approximate 1/2 of the load of the softirq (since RT tasks=20
previously enjoyed a 2*MAX_SCHED_OTHER_LOAD rating.

I have observed this phenomenon (and its fix) while looking at things=20
like network intensive workloads.  I'm sure there are plenty of others=20
that could cause similar ripples.

The fact is, the scheduler treats "load" to mean certain things which=20
simply did not apply to RT tasks.  As you know very well im sure ;),=20
"load" is a metric which expresses the share of the cpu that will be=20
consumed and this is used by the load balancer to make its decisions. =20
However, you can put whatever rating you want on an RT task and it would =

always be irrelevant.  RT tasks run as frequently and as long as they=20
want (w.r.t. SCHED_OTHER) independent of what their load rating implies=20
to the balancer, so you cannot make an accurate assessment of the true=20
"available shares".  This is why the load-balancer would become confused =

and fail to see true imbalance in a mixed environment.  Fixing this, as=20
Peter has attempted to do, will result in a much better distribution of=20
SCHED_OTHER tasks across the true available bandwidth, and thus improve=20
overall performance.

In previous discussions with people, I had always used a metaphor of a=20
stream.  A system running SCHED_OTHER tasks is like a smooth running=20
stream, but  dispatching an RT task (or an IRQ, even) is like throwing a =

boulder into the water.  It makes a big disruptive splash and causes=20
turbulent white water behind it.  And the stream has no influence over=20
the size of the boulder, its placement in the stream, nor how long it=20
will be staying.

This fix (at least in concept) allows it to become more like gently=20
slipping a streamlined aerodynamic object into the water.  The stream=20
still cannot do anything about the size or placement of the object, but=20
it can at least flow around it and smoothly adapt to the reduced volume=20
of water that the stream can carry. :)

HTH
-Greg
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [PATCH] sched: properly account IRQ and RT load in SCHED..., Gregory Haskins, (Thu Aug 21, 8:26 am)