On Tue, Sep 25, 2007 at 01:33:06PM +0200, Ingo Molnar wrote:
You seem to have hit the nerve for this problem. The two patches I sent:
http://lkml.org/lkml/2007/9/25/117http://lkml.org/lkml/2007/9/25/168
partly help, but we can do better.
This one feels wrong, although I can't express my reaction correctly ..
Note that parent entities for a task is per-cpu. So if a task A
belonging to userid guest hops from CPU0 to CPU1, then it gets a new parent
entity as well, which is different from its parent entity on CPU0.
Before:
taskA->se.parent = guest's tg->se[0]
After:
taskA->se.parent = guest's tg->se[1]
So walking up the entity hierarchy and fixing up (parent)se->vruntime will do
little good after the task has moved to a new cpu.
IMO, we need to be doing this :
- For dequeue of higher level sched entities, simulate as if
they are going to "sleep"
- For enqueue of higher level entities, simulate as if they are
"waking up". This will cause enqueue_entity() to reset their
vruntime (to existing value for cfs_rq->min_vruntime) when they
"wakeup".
If we don't do this, then lets say a group had only one task (A) and it
moves from CPU0 to CPU1. Then on CPU1, when group level entity for task
A is enqueued, it will have a very low vruntime (since it was never
running) and this will give task A unlimited cpu time, until its group
entity catches up with all the "sleep" time.
Let me try a fix for this next ..
--
Regards,
vatsa
-