Re: [PATCH] sched: Give cpusets exclusive control over sched domains (ie remove cpu_isolated_map)

Previous thread: build issue #509 for v2.6.26-rc4-27-g0a2ce2f :undefined reference to `xquad_portio' by Toralf on Thursday, May 29, 2008 - 2:13 pm. (2 messages)

Next thread: [PATCH 1/6] Werror: Fix casting wrong size integer to pointer by David Howells on Thursday, May 29, 2008 - 2:17 pm. (20 messages)
To: <mingo@...>
Cc: <pj@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>, Max Krasnyansky <maxk@...>
Date: Thursday, May 29, 2008 - 2:17 pm

First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.

Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
---
kernel/cpuset.c | 6 ++++++
kernel/sched.c | 22 ++++++++++++++++++++++
2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a1b61f4..29c6304 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1789,6 +1789,12 @@ static void common_cpu_mem_hotplug_unplug(void)
top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
scan_for_empty_cpusets(&top_cpuset);

+ /*
+ * Scheduler destroys domains on hotplug events.
+ * Rebuild them based on the current settings.
+ */
+ rebuild_sched_domains();
+
cgroup_unlock();
}

diff --git a/kernel/sched.c b/kernel/sched.c
index 8dcdec6..fc8f46b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6855,6 +6855,18 @@ void __attribute__((weak)) arch_update_cpu_topology(void)
}

/*
+ * Free current domain masks.
+ * Called after all cpus are attached to NULL domain.
+ */
+static void free_sched_domains(void)
+{
+ ndoms_cur = 0;
+ if (doms_cur != &fallback_doms)
+ kfree(doms_cur);
+ doms_cur = &fallback_doms;
+}
+
+/*
* Set up scheduler domains and groups. Callers must hold the hotplug lock.
* For now this just excludes isolated cpus, but coul...

To: Max Krasnyansky <maxk@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Monday, June 2, 2008 - 11:19 am

I think this patch is ok --- though (1) perhaps it should have been two patches,
as it seems to fix two bugs, and (2) I only managed to get my head about 80% of
the way around this, so while I confident enough to Ack this one, I'm not
as certain as I might be.

Acked-by: Paul Jackson <pj@sgi.com>

I did see a couple of places where the code could be more clearly presented,
and a couple of lines ending in space characters in *.c files (the Doc files
are hopelessly failing this space terminator test, so I gave up on them.)

The above patch, on top of the code already there, ends up with the
following code for update_sched_domains():

============================== begin ==============================
static int update_sched_domains(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
case CPU_DOWN_PREPARE:
case CPU_DOWN_PREPARE_FROZEN:
detach_destroy_domains(&cpu_online_map);
free_sched_domains();
return NOTIFY_OK;

case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
case CPU_DOWN_FAILED:
case CPU_DOWN_FAILED_FROZEN:
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
case CPU_DEAD:
case CPU_DEAD_FROZEN:
/*
* Fall through and re-initialise the domains.
*/
break;
default:
return NOTIFY_DONE;
}

#ifndef CONFIG_CPUSETS
/*
* Create default domain partitioning if cpusets are disabled.
* Otherwise we let cpusets rebuild the domains based on the
* current setup.
*/

/* The hotplug lock is already held by cpu_up/cpu_down */
arch_init_sched_domains(&cpu_online_map);
#endif

return NOTIFY_OK;
}
=============================== end ========...

To: Paul Jackson <pj@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Monday, June 2, 2008 - 3:01 pm

Paul, I just realized that this won't work. We actually need the default
version of arch_init_sched_domains() which builds the default domain. It's
used during bootup. It's only the hotplug event handling where we do not want
the default.
I could still redo the switch() statement as you suggested but it seems that
it's not going to improve the readability that much at this point.

Max
--

To: Paul Jackson <pj@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Monday, June 2, 2008 - 1:55 pm

I'll fix the spaces and redo the scheduler parts based on your suggestions.
As usual, nice and thorough review. Thanx.

Max
--

To: <mingo@...>
Cc: <pj@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>, Max Krasnyansky <maxk@...>
Date: Thursday, May 29, 2008 - 2:17 pm

kernel/cpu.c seems a more logical place for those maps since they do not really
have much to do with the scheduler these days.

kernel/cpu.c is now built for the UP kernel too, but it does not affect the size
the kernel sections.

$ size vmlinux

before
text data bss dec hex filename
3313797 307060 310352 3931209 3bfc49 vmlinux

after
text data bss dec hex filename
3313797 307060 310352 3931209 3bfc49 vmlinux

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
---
kernel/Makefile | 4 ++--
kernel/cpu.c | 24 ++++++++++++++++++++++++
kernel/sched.c | 18 ------------------
3 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index 6c584c5..bb8da0a 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -3,7 +3,7 @@
#

obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
- exit.o itimer.o time.o softirq.o resource.o \
+ cpu.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o extable.o params.o posix-timers.o \
@@ -27,7 +27,7 @@ obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
-obj-$(CONFIG_SMP) += cpu.o spinlock.o
+obj-$(CONFIG_SMP) += spinlock.o
obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o
obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
obj-$(CONFIG_UID16) += uid16.o
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 2eff3f6..6aa48cf 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -15,6 +15,28 @@
#include <linux/stop_machine.h>
#include <linux/mutex.h>

+/*
+ * Represents all cpu's present in the system
+ * In systems capable of hotplug, this map could dynamically grow
+ * as new cpu's are detected in the system via any pla...

To: Max Krasnyansky <maxk@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>, <maxk@...>
Date: Thursday, May 29, 2008 - 7:26 pm

I don't think this is possible. It must affect the size.

The patch moves kernel/cpu.o from being included only if CONFIG_SMP,
to being included always:

obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
- exit.o itimer.o time.o softirq.o resource.o \
+ cpu.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o extable.o params.o posix-timers.o \
@@ -27,7 +27,7 @@ obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
-obj-$(CONFIG_SMP) += cpu.o spinlock.o
+obj-$(CONFIG_SMP) += spinlock.o

The kernel/cpu.o object file is non-empty; it has size,
perhaps 1 or 2 kbytes of text, and a little bit of data.

So adding it to builds which disabled SMP must add to the
size of the resultant kernel, by my thinking anyway.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

To: Paul Jackson <pj@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Friday, May 30, 2008 - 12:08 am

All the code is ifdefed under CONFIG_SMP. So there is no code in the !SMP
case, just three masks. When you mentioned it first time I went back and
rebuilt the !SMP kernel with and without the patch. Luckily I have a
ridiculously fast 8way box here so I went ahead and verified it again.

Before:
size vmlinux
text data bss dec hex filename
3305622 306420 310352 3922393 3bd9d9 vmlinux

size kernel/cpu.o
size: 'kernel/cpu.o': No such file

After:
size vmlinux
text data bss dec hex filename
3305621 306420 310352 3922393 3bd9d9 vmlinux

size kernel/cpu.o
text data bss dec hex filename
96 24 0 120 78 kernel/cpu.o

Max
--

To: Max Krasnyansky <maxk@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Friday, May 30, 2008 - 12:10 am

Ah - that's what I missed. You're right.
I withdraw this objection.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

To: <mingo@...>
Cc: <pj@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>, Max Krasnyansky <maxk@...>
Date: Thursday, May 29, 2008 - 2:17 pm

Ingo and Peter mentioned several times that cpu_isolated_map was a horrible hack.
So lets get rid of it.

cpu_isolated_map is controlling which CPUs are subject to the scheduler load balancing.
CPUs set in that map are put in the NULL scheduler domain and are excluded from the
load balancing. This functionality is provided in much more flexible and dynamic way by
the cpusets subsystem. Scheduler load balancing can be disabled/enabled either system
wide or per cpuset.

This patch gives cpusets exclusive control over the scheduler domains.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
---
Documentation/cpusets.txt | 3 +--
kernel/sched.c | 34 +++++-----------------------------
2 files changed, 6 insertions(+), 31 deletions(-)

diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index ad2bb3b..d8b269a 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -382,8 +382,7 @@ Put simply, it costs less to balance between two smaller sched domains
than one big one, but doing so means that overloads in one of the
two domains won't be load balanced to the other one.

-By default, there is one sched domain covering all CPUs, except those
-marked isolated using the kernel boot time "isolcpus=" argument.
+By default, there is one sched domain covering all CPUs.

This default load balancing across all CPUs is not well suited for
the following two situations:
diff --git a/kernel/sched.c b/kernel/sched.c
index 5ebf6a7..e2eb2be 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6206,24 +6206,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
}

-/* cpus with isolated domains */
-static cpumask_t cpu_isolated_map = CPU_MASK_NONE;
-
-/* Setup the mask of cpus configured for isolated domains */
-static int __init isolated_cpu_setup(char *str)
-{
- int ints[NR_CPUS], i;
-
- str = get_options(str, ARRAY_SIZE(ints), ints);
- cpus_clear(cpu_isolated_ma...

To: Max Krasnyansky <maxk@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>, <maxk@...>
Date: Tuesday, June 3, 2008 - 8:49 pm

I'll NAQ this one.

As a result of the separate lkml thread:

Inquiry: Should we remove "isolcpus= kernel boot option? (may have realtime uses)

and as a result of there not yet being a good reason -to- remove this,
other than a few kernel developers who don't like it, I conclude that
we should not just remove the "isolcpus=" feature.

It might be that we should deprecate it, as I have suggested in the
past. But I'm not now seeing even a good reason for that.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

To: Max Krasnyansky <maxk@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>, <maxk@...>
Date: Thursday, May 29, 2008 - 6:37 pm

Well .. if you can just resent patches for no apparent reason, with
no apparent change, I guess I can just resent my reply, similarly.

Yeah ... a hack ... but I suspect some folks use it.

I'm reluctant to discard features visible to users, unless
they are getting in the way of more serious stuff.

I'd have figured that this hack was not all that much of
a pain to the kernel code.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

To: Paul Jackson <pj@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Thursday, May 29, 2008 - 6:59 pm

Sorry I just updated and resent all my outstanding sched fixes.

btw I suppose you missed my reply. Where I provided the reasoning. Also Ingo
at some point accepted a patch from Peter that partially removed the isolated
map. This is just more complete version.

Anyway, I do not really care that much for this particular patch. But I do
care about cpu hotplug / domain handling issues. Can you please look at that
instead of resending your replies ;-).
I'm talking about
[PATCH] sched: CPU hotplug events must not destroy scheduler domains created

by the cpusets

Also please ack (take 3) version of the default irq affinity patch that
includes your latest comments.

Thanx

--

To: Max Krasnyanskiy <maxk@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Thursday, May 29, 2008 - 8:12 pm

Reference? Did it remove the boot parameter isolcpus= ?

My concern is removing user visible features, without
good reason and careful consideration and some planning.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

To: Paul Jackson <pj@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Friday, May 30, 2008 - 12:24 am

Yes it removed boot param. I cannot seem to come up with the right search
I understand. How would you plan for this ?
----

Did you get a chance to look at patch that addresses hotplug/domain issues ?

Max
--

To: Max Krasnyansky <maxk@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Friday, May 30, 2008 - 2:52 am

Ah - ok - I got that reply, and then lost track of it. My bad.

This doesn't make sense to me. We don't just decree that we aren't
planning on supporting something that's already out there and being
used, and then remove it, on the grounds we aren't supporting it.

I don't plan for removing it yet, because I haven't seen a good

Not yet.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

To: Paul Jackson <pj@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Saturday, May 31, 2008 - 3:12 pm

Ok. Let me ask you this. Would you be ok with a patch that exposes (via sysctl
for example) scheduler balancer mask when cpusets are disabled ?
In other words it will look something like this:
- Rename cpu_isolated_map to sched_balancer_map
- If cpusets are enabled
o balancer map is compiled out or a noop
o isolcpus= boot param is compiled out

- If cpusets are disabled
o balancer map can be changed via /proc/sys/kernel/sched_balancer_mask
writing to it rebuilds scheduler domains
cpus not in the mask will be put into NULL domain
o isolcpus= boot param is available for compatibility

Why do this ?
Two reasons. It would not longer be a hack, it simply exposes scheduler
feature that is not otherwise available without cpusets. And there is no
conflict with sched domain management when cpusets are enabled. ie cpuset have
exclusive control on domains).

Max

--

To: Max Krasnyansky <maxk@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Sunday, June 1, 2008 - 10:35 pm

I wasn't looking for a variant implementation. Unless this variant
serves some critical purpose that isolcpus can't provide, I would be
against it. I'm trying to minimize API changes to kernel users.

I am trying to discuss the reasons for or against removing isolcpus.

I just started a separate lkml thread, with a wider audience, to
address this question:

Inquiry: Should we remove "isolcpus= kernel boot option? (may have realtime uses)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

To: Max Krasnyansky <maxk@...>
Cc: Paul Jackson <pj@...>, <mingo@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Sunday, June 1, 2008 - 5:21 am

Uhm, might be me but those two answers are not an answer to the question
posed.

Anyway, no, yuck! - let just get rid of it.

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <maxk@...>, <mingo@...>, <linux-kernel@...>, <menage@...>, <rostedt@...>
Date: Sunday, June 1, 2008 - 10:39 pm

I object to "just" (as in suddenly, without deprecation or prior
warning) removing user space visible kernel features, unless we
have some good reason why we can't take a more methodical approach,
deprecating first.

But I've said that before.

I just started a separate lkml thread, with a wider audience, to
address this question:

Inquiry: Should we remove "isolcpus= kernel boot option? (may have realtime uses)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

Previous thread: build issue #509 for v2.6.26-rc4-27-g0a2ce2f :undefined reference to `xquad_portio' by Toralf on Thursday, May 29, 2008 - 2:13 pm. (2 messages)

Next thread: [PATCH 1/6] Werror: Fix casting wrong size integer to pointer by David Howells on Thursday, May 29, 2008 - 2:17 pm. (20 messages)