From: Max Krasnyanskiy <maxk@qualcomm.com> Addressed Ingo's comments and merged on top of latest Linus's tree. This is based on Linus' idea of creating cpu_active_map that prevents scheduler load balancer from migrating tasks to the cpu that is going down. It allows us to simplify domain management code and avoid unecessary domain rebuilds during cpu hotplug event handling. Please ignore the cpusets part for now. It needs some more work in order to avoid crazy lock nesting. Although I did simplfy and unify domain reinitialization logic. We now simply call partition_sched_domains() in all the cases. This means that we're using exact same code paths as in cpusets case and hence the test below cover cpusets too. Cpuset changes to make rebuild_sched_domains() callable from various contexts are in the separate patch (right next after this one). This not only boots but also easily handles while true; do make clean; make -j 8; done and while true; do on-off-cpu 1; done at the same time. (on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing). Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing this on right now in gnome-terminal and things are moving just fine. Also this is running with most of the debug features enabled (lockdep, mutex, etc) no BUG_ONs or lockdep complaints so far. I beleive I addressed all of the Dmitry's comments for original Linus' version. I changed both fair and rt balancer to mask out non-active cpus. And replaced cpu_is_offline() with !cpu_active() in the main scheduler code where it made sense (to me). Peter, please take a look at how I merged it with your latest RT runtime fixes. I basically moved calls to disable_runtime() into the separate notifier callback since we do not need update_sched_domains() anymore if cpusets are enabled. Greg, please take a look at the RT balancer merge. I decided not to muck with the cpupri thing and instead mask inactive cpus after the search. I've probably missed ...
why not just fixing all spelling mistakes instead of keeping them around ;) Regards Marcel --
What spelling mistakes? --
so it wasn't meant to say "... if cpu ..." and I simply miss something here (which is possible)? Regards Marcel --
It's a math term, iff means: if and only if. --
Heh. You don't have enough of a math background, it seems. 'iff' is a stronger 'if' - it's 'if and only if'. Linus --
<1216122229-4865-1-git-send-email-maxk@qualcomm.com>, Max Krasnyansky Ill take a look at this later today. Regards, -Greg --
Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Thanks for doing the cleanups and the details. Linus --
a few remarks: (1) in __migrate_task(), a test for !cpu_active(dest_cpu) should be done after double_rq_lock() [ or add the second one ] migration_thread() calls __migrate_task() with disabled interrupts (no rq-locks held), i.e. if we merely rely on rq-locks for synchronization, this can race with cpu_down(dest_cpu). [ assume, the test was done in __migration_task() and it's about to take double_lock()... and at this time, down_cpu(dest_cpu) starts and completes on another CPU ] note, we may still take the rq-lock for a "dead" cpu in this case and then only do a check (remark: in fact, not with stop_machine() in place _but_ I consider that we don't make any assumptions on its behavior); (2) it's worth to take a look at the use of any_online_cpu(): many places are ok, because there will be an additional check against cpu_active_mask later on, but e.g. set_cpus_allowed_ptr() -> migrate_task(p, any_online_cpu(mask), ...) -> migrate_task(p, dest_cpu) doesn't seem to have any verifications wrt cpu_active_map. (3) I assume, we have some kind of explicit sched_sync() after cpu_clear(cpu, cpu_active_mask) because: (a) not all places where task-migration may take place do take the rq-lock for dest_cpu : e.g. try_to_wake_up() or even sched_migrate_task() [ yes, there is a special (one might call "subtle") assumption/restriction in this case ] that's why the fact that cpu_down() takes the rq-lock for soon-to-be-offline cpu at some point can not be a "natural" sync. point to guarantee that "stale" cpu_active_map is not used. (b) in fact, stop_machine() acts as a (very strong) sync. point, sched-wise. But perhaps, we don't want to have this new easy-to-follow approach to be built on top of assumptions on how something from -- Best regards, Dmitry Adamushko --
Hmm, as you suggested I added synchronize_sched() after clearing the active bit (see below). Is that not nought enough ? I mean you mentioned that stop_machine syncs things up, I assume synchronize_sched does too. I guess testing inside the double_rq_lock() does not hurt anyway. We already have fail recovery path there. But are you sure it's needed given the explicit sync (in fact we have double sync now :), one with synchronize_sched() and How about we just introduce any_active_cpu() and replace all the usages of Yep. As you suggested I've added synchronize_sched() and updated the comment that explains the deal with the stop machine. http://lkml.org/lkml/2008/7/15/736 Peter, already ACKed it. Max --
Yes, sorry for the noise here. * synchronize_sched - block until all CPUs have exited any non-preemptive * kernel code sequences so "any non-preemptive" sections, not just the ones with run-queue I think, at least for places related to task placement (like migrate_task(..., any_online_cpu()) it would make sense, -- Best regards, Dmitry Adamushko --
>>> On Tue, Jul 15, 2008 at 7:43 AM, in message <1216122229-4865-1-git-send-email-maxk@qualcomm.com>, Max Krasnyansky Hi Max, Its still early in the morning, and I havent had my coffee yet, so what I am about to say may be totally bogus ;) ..but, I am not sure we need to do this mask here. If the hotcpu_notifiier is still running (and it appears that it is) the runqueue that is taken offline will be removed from cpupri as well. Or perhaps I am misunderstanding the intention of "active" verses "online". If I understand correctly, active and online mean more or less the same thing, but splitting it out like this allows us to skip rebuilding the domains on every hotplug. Is that correct? Assuming that the above is true, and assuming that the hotcpu_notifier is still invoked when the online status changes, cpupri will be properly updated to exclude the offline core. That will save an extra cpus_and (which the big-iron guys will be happy about ;) Regards, -Greg --
Basically with the cpu_active_map we're saying that sched domain masks may contain cpus that are going down, and the scheduler is supposed to ignore those (by checking cpu_active_map). ie The idea was to simplify cpu hotplug handling. My impression was that cpupri updates are similar to the sched Ah, now I see what you mean by the hotplug handler is still running. You're talking about set_rq_online()/set_rq_offline() calls from migration_call(). Yes did not know what they were for and did not touch that path. btw I'm definitely with you on the cpus_and() thing. When I added it in both balancers I thought that it quite an overhead on bigger boxes. So I'm not sure what the best way to handle this. If we say we're relying on hotplug event sequence to ensure that rt balancer state is consistent then we kind of back to square one. ie Might as do the same for the sched domains. I guess we could update cpupri state when we update cpu_active_map. That way the two will be synced up and we do not have to "and" them in the fast path. Any other thoughts ? Max --
>>> On Wed, Jul 16, 2008 at 5:44 PM, in message <487E6BD7.3020006@qualcomm.com>, Well, admittedly I am not entirely clear on what problem is being solved as I was not part of the original thread with Linus. My impression of what you were trying to solve was to eliminate the need to rebuild the domains for a hotplug event (which I think is a good problem to solve), thus eliminating some complexity and (iiuc) races there. However, based on what you just said, I am not sure I've got that entirely right anymore. Can you clarify the intent (or point me at the original thread) Well, not quite. cpupri ends up hanging off of a root-domain, so its more closely related to the global cpu_XXX_maps than it is to the domains. As you know, to clear a bit from the domains means walking each RQ and each domain within that runqueue since the domain structures are per-cpu. This is why historically the domains have been rebuilt when a cpu is plugged in (iiuc). However, with cpupri (and root-domains in general) the structure is shared among all member RQs. Therefore, taking a cpu online/offline is simply a matter of updating a single cpumap_t. This happens today (in sched-devel, Im not following you here (but again, it might be because of my misunderstanding of what it is you are actually solving). I see the cleanup that you did to not require rebuilding domains on hotplug as a really good thing. However, I don't see the problem with updating cpupri in the manner I mentioned either. The rq_offline routine is invoked on the DYING event (pre-offline), and the rq_online on the ONLINE event. This means that it will effectively disable the runqueue at the earliest possible moment in the hotplug processing, thereby removing the possibility that we will route to that core. My gut tells me that if this is not adequate, there there is something wrong with the hotplug notifier infrastructure and we should fix that, not put some more infrastructure on top of it. :) Don't get me wrong. I ...
Here is the link to the original thread http://lkml.org/lkml/2008/7/11/328 And here is where Linus explained the idea http://lkml.org/lkml/2008/7/12/137 I'll reply to the rest of your email tomorrow (can't keep my yes open any longer :)). Max --
>>> On Thu, Jul 17, 2008 at 3:16 AM, in message <487EF1E9.2040101@qualcomm.com>, Hi Max, Thanks for the pointers. I see that I did indeed misunderstand the intent of the patch. It seems you already solved the rebuild problem, and were just trying to solve the "migrate to a dead cpu" problem that Linus mentions as a solution with cpu_active_map. In that case, note that rq->rd->online already fits the bill, I believe. In a nutshell, rq->rd->span contains all the cpus within your disjoint cpuset, and rq->rd->online, contains the subset of rq->rd->span that are online. The online bit is cleared at the earliest point in cpu hotplug removal (DYING), and it is set at the very latest point on insertion (ONLINE). Therefore it is redundant with the cpus_active_map concept. I think the simplest solution is to make sure that we cpus_and against rq->rd->online before allowing a migration. This is how I intended the mask to be used, anyway. Its what the RT scheduler does. It sounds like we just need to touch up the few places in the CFS side that were causing those oops. Thoughts? -Greg --
Yes. btw they are definitely related, because the reason we were blowing away the domains is to avoid "migration to a dead cpu". ie We were relying on the None at this point :). I need to run right now and will try to look at this later today. My knowledge of the internal sched structs is definitely lacking so I need to look at the rq->rd thing to have and opinion. Thanx Max --
>>> On Thu, Jul 17, 2008 at 2:52 PM, in message <487F9509.9050802@qualcomm.com>, Sounds good, Max. Thanks! -Greg --
I'm thinking doing it explicitly with the new cpu mask is clearer and easier to understand than 'hiding' the variable in the root domain and having to understand all the domain/root-domain stuff. I think this was Linus' main point. It should be easier to understand this code. So, if there is functional overlap with the root domain stuff, it might be good to remove that bit and use the cpu_active_map stuff for that instead. --
While I can appreciate this sentiment, note that we conceptually require IMO the notion that the root-domain masks present. E.g. we really dont want to migrate to just cpus_active_map, but rather rq->rd->span & cpus_active_map (otherwise we could route outside of a disjoint cpuset). And this is precisely what rq->rd->online is (a cached version of cpus_active_map & rq->rd->span). So while I can understand the motivation to keep things simple, note that I tried to design the root-domain stuff to be as simple as possible while still meeting the requirements for working within disjoint sets. I am open to other suggestions, but I see nothing particularly complex or wrong with whats going on there currently. Perhaps this very I think we would be doing the code that does use it a disservice, per above. Regards, -Greg --
Sorry for the delay. I finally had a chance to read through this thread again and to look at the rq->rd->online logic. One thing that came up during original discussion with Linus and Dmitry is that cpuset can call partition_sched_domains() at random (user writes into /dev/cpuset/...) but the scheduler used to rely on a certain sequence of the domain creation/deletion during cpu hotplug. That's exactly what caused the problem in the first place. My patch that fixed domain destruction by the hotplug events changed the sequence the scheduler relied on and things broke. Greg, correct me if I'm wrong but we seem to have exact same issue with the rq->rq->online map. Lets take "cpu going down" for example. We're clearing rq->rd->online bit on DYING event, but nothing AFAICS prevents another cpu calling rebuild_sched_domains()->partition_sched_domains() in the middle of the hotplug sequence. partition_sched_domains() will happily reset rd->rq->online mask and things will fail. I'm talking about this path __build_sched_domains() -> cpu_attach_domain() -> rq_attach_root() ... cpu_set(rq->cpu, rd->span); if (cpu_isset(rq->cpu, cpu_online_map)) set_rq_online(rq); ... -- btw Why didn't we convert sched*.c to use rq->rd->online when it was introduced ? ie Instead of using cpu_online_map directly. Max --
I think you are right, but wouldn't s/online/active above fix that as=20
well? The active_map didnt exist at the time that code went in initially=
I think things were converted where they made sense to convert. But we=20
also had a different goal at that time, so perhaps something was=20
missed. If you think something else should be converted, please point=20
it out.
In the meantime, I would suggest we consider this patch on top of yours=20
(applies to tip/sched/devel):
----------------------
sched: Fully integrate cpus_active_map and root-domain code
=20
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
diff --git a/kernel/sched.c b/kernel/sched.c
index 62b1b8e..99ba70d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6611,7 +6611,7 @@ static void rq_attach_root(struct rq *rq, struct=20
root_domain *rd)
rq->rd =3D rd;
=20
cpu_set(rq->cpu, rd->span);
- if (cpu_isset(rq->cpu, cpu_online_map))
+ if (cpu_isset(rq->cpu, cpu_active_map))
set_rq_online(rq);
=20
spin_unlock_irqrestore(&rq->lock, flags);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 7f70026..2bae8de 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1004,7 +1004,7 @@ static void yield_task_fair(struct rq *rq)
* search starts with cpus closest then further out as needed,
* so we always favor a closer, idle cpu.
* Domains may include CPUs that are not usable for migration,
- * hence we need to mask them out (cpu_active_map)
+ * hence we need to mask them out (rq->rd->online)
*
* Returns the CPU we should wake onto.
*/
@@ -1032,7 +1032,7 @@ static int wake_idle(int cpu, struct task_struct *p=
)
|| ((sd->flags & SD_WAKE_IDLE_FAR)
&& !task_hot(p, task_rq(p)->clock, sd))) {
cpus_and(tmp, sd->span, p->cpus_allowed);
- cpus_and(tmp, tmp, cpu_active_map);
+ cpus_and(tmp, tmp, task_rq(p)->rd->online);
for_each_cpu_mask(i, tmp) {
if ...Right, makes sense. --
Thanks Peter. BTW: I should mention that this was just an RFC. I haven't even build=20 tested it yet. I will do this now. Regards,
Heh, might make sense to ensure it indeed builds before stuffing it into -tip ;-) --
Ok, it passes the sniff test without any corrections. :) Change the status from RFC to consider-for-inclusion. -Greg
Actually after a bit more thinking :) I realized that the scenario I explained above cannot happen because partition_sched_domains() must be called under get_online_cpus() and the set_rq_online() happens in the hotplug writer's path (ie under cpu_hotplug.lock). Since I unified all the other domain rebuild paths (arch_reinit_sched_domains, etc) we should be safe. But it again means we'd rely on those intricate dependencies that we wanted to avoid with the cpu_active_map. Also cpusets might still need to rebuild the domains in the hotplug writer's path. Ack. The only thing I'm a bit unsure of is the error scenarios in the cpu hotplug event sequence. online_map is not cleared when something in the notifier chain fails, but active_map is. Max --
Hi Ingo, Here is another patch submitted that has not been acked/nacked yet. =20 If you get a free moment, please let me know your thoughts. Here is the = full thread for your convenience: http://lkml.org/lkml/2008/7/22/281 (and FYI it was ACKed by Peter here: http://lkml.org/lkml/2008/7/22/286) -Greg
I thought this went in already. It looks good to me too. --
applied to tip/sched/devel - thanks Max. Ingo --
