Ingo Molnar wrote:I used to have a great demo for the prototype I was working on, but id=20 have to dig it up. The gist of it is that the pre-patched scheduler=20 basically gets thrown for a completely loop in the presence of a mixed=20 CFS/RT environment. This isn't a PREEMPT_RT specific problem per se,=20 though PREEMPT_RT does bring the problem to the forefront since it has=20 so many active RT tasks by default (for the IRQs, etc) which make it=20 more evident. Since an RT tasks previous usage of declaring "load" did not actually=20 express the true nature of the RQ load, CFS tasks would have a few=20 really nasty things happen to them while trying to run on the system=20 simultaneously. One of them was that you could starve out CFS tasks=20 from certain cores (even though there was plenty of CPU bandwidth=20 available elsewhere) and the load-balancer would think everything is=20 fine and thus fail to make adjustments. Say you have a 4 core system. You could, for instance, get into a=20 situation where the softirq-net-rx thread was consuming 80% of core 0,=20 yet the load balancer would still spread, say, a 40 thread CFS load=20 evenly across all cores (approximately 10 per core, though you would=20 account for the "load" that the softirq thread contributed too). The=20 threads on the other cores would of course enjoy 100% bandwidth, while=20 the ~10 threads on core 0 would only see 1/5th of that bandwidth. What it comes down to is that the CFS load should have been evenly=20 distributed across the available bandwidth of 3*100% + 1*20%, not 4*100% = as it does today. The net result is that the application performs in a=20 very lopsided manner, with some threads getting significantly less (or=20 sometimes zero!) cpu time compared to their peers. You can make this=20 more obvious by nice'ing the CFS load up as high as it will go, which=20 will approximate 1/2 of the load of the softirq (since RT tasks=20 previously enjoyed a 2*MAX_SCHED_OTHER_LOAD rating. I have observed this phenomenon (and its fix) while looking at things=20 like network intensive workloads. I'm sure there are plenty of others=20 that could cause similar ripples. The fact is, the scheduler treats "load" to mean certain things which=20 simply did not apply to RT tasks. As you know very well im sure ;),=20 "load" is a metric which expresses the share of the cpu that will be=20 consumed and this is used by the load balancer to make its decisions. =20 However, you can put whatever rating you want on an RT task and it would = always be irrelevant. RT tasks run as frequently and as long as they=20 want (w.r.t. SCHED_OTHER) independent of what their load rating implies=20 to the balancer, so you cannot make an accurate assessment of the true=20 "available shares". This is why the load-balancer would become confused = and fail to see true imbalance in a mixed environment. Fixing this, as=20 Peter has attempted to do, will result in a much better distribution of=20 SCHED_OTHER tasks across the true available bandwidth, and thus improve=20 overall performance. In previous discussions with people, I had always used a metaphor of a=20 stream. A system running SCHED_OTHER tasks is like a smooth running=20 stream, but dispatching an RT task (or an IRQ, even) is like throwing a = boulder into the water. It makes a big disruptive splash and causes=20 turbulent white water behind it. And the stream has no influence over=20 the size of the boulder, its placement in the stream, nor how long it=20 will be staying. This fix (at least in concept) allows it to become more like gently=20 slipping a streamlined aerodynamic object into the water. The stream=20 still cannot do anything about the size or placement of the object, but=20 it can at least flow around it and smoothly adapt to the reduced volume=20 of water that the stream can carry. :) HTH -Greg
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Rafael J. Wysocki | [Bug #11210] libata badness |
| Andrea Arcangeli | [PATCH 00 of 11] mmu notifier #v16 |
| Andrew Morton | Re: -mm merge plans for 2.6.23 -- sys_fallocate |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Daniel Eischen | Re: error with thread |
| David Miller | Re: [GIT]: Networking |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
