Gregory Haskins wrote:Sorry for the delay. I finally had a chance to read through this thread again and to look at the rq->rd->online logic. One thing that came up during original discussion with Linus and Dmitry is that cpuset can call partition_sched_domains() at random (user writes into /dev/cpuset/...) but the scheduler used to rely on a certain sequence of the domain creation/deletion during cpu hotplug. That's exactly what caused the problem in the first place. My patch that fixed domain destruction by the hotplug events changed the sequence the scheduler relied on and things broke. Greg, correct me if I'm wrong but we seem to have exact same issue with the rq->rq->online map. Lets take "cpu going down" for example. We're clearing rq->rd->online bit on DYING event, but nothing AFAICS prevents another cpu calling rebuild_sched_domains()->partition_sched_domains() in the middle of the hotplug sequence. partition_sched_domains() will happily reset rd->rq->online mask and things will fail. I'm talking about this path __build_sched_domains() -> cpu_attach_domain() -> rq_attach_root() ... cpu_set(rq->cpu, rd->span); if (cpu_isset(rq->cpu, cpu_online_map)) set_rq_online(rq); ... -- btw Why didn't we convert sched*.c to use rq->rd->online when it was introduced ? ie Instead of using cpu_online_map directly. Max --
| David Miller | Re: Slow DOWN, please!!! |
| Greg Kroah-Hartman | [PATCH 013/196] Documentation: Replace obsolete "driverfs" with "sysfs". |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Jeff Garzik | Re: [RFC] Heads up on sys_fallocate() |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Linus Torvalds | Re: [GIT]: Networking |
| Andrew Morton | Re: [BUG] New Kernel Bugs |
