Current get_online_cpus() acquires a mutex lock and then
release it. It is not scale and it hurts cache. This patch rewrite it.
1) get_online_cpus() must be allowed to be called recursively, so I added
get_online_cpus_nest for every task for new code.
This patch just allows get_online_cpus() to be called recursively,
but when it is not nested, get_online_cpus() will wait until
cpuhotplug finished, so the potential starvation is avoided.
And, the livelock of cpu_hotplug_begin() is avoided, so the comment
is removed.
2) This new code use PER_CPU counters, and this counters protected
by RCU. These counters acts like the reference counters of a modules.
(Actually, all these code is stolen from module.c: try_refcount_get()
is stolen from try_module_get(), put_online_cpus() from module_put()...)
After this patch applied, get_online_cpus() is very light and scale when
cpuhotplug is not running. It just disables preemption and increase
the cpu counter and then enables preemption.
3) Since we have try_refcount_get(), I add a new API try_get_online_cpus().
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
include/linux/cpu.h | 2
include/linux/sched.h | 3 +
kernel/cpu.c | 131 ++++++++++++++++++++++++++++++++------------------
kernel/fork.c | 3 +
4 files changed, 94 insertions(+), 45 deletions(-)
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index e287863..a32809c 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -112,6 +112,7 @@ extern struct sysdev_class cpu_sysdev_class;
extern void get_online_cpus(void);
extern void put_online_cpus(void);
+extern int try_get_online_cpus(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
#define register_hotcpu_notifier(nb) register_cpu_notifier(nb)
#define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb)
@@ -134,6 +135,7 @@ static inline void cpu_hotplug_driver_unlock(void)
#define get_online_cpus() do { ...Well, iirc one of the goals of cpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus() 86ef5c9a8edd78e6bf92879f32329d89b2d55b5a was avoiding the new members in task_struct. I leave this up to you and Gautham. Lai, I didn't read this patch carefully yet (and I can't apply it to This looks unsafe. In theory nothing protects cpu_hotplug_task from exiting if refcount_sum() becomes zero, this means wake_up_process() can hit the freed/reused/unmapped task_struct. Probably cpu_hotplug_done() needs another synhronize_sched() before return. OTOH, I do not understand why the result of __get_cpu_var(refcount) must be visible to refcount_sum() if we race with cpu_hotplug_begin(), so it seems to me cpu_hotplug_begin() also needs synchronize_sched() before refcount_sum(). Oleg. --
No, I misread the unapplied patch, sorry for noise. Oleg. --
Old get_online_cpus() is read-preference, I think the goal of this ability is allow get_online_cpus()/put_online_cpus() to be called nested. But read-preference RWL may cause write side starvation, so I abandon this ability, and use per-task counter for allowing get_online_cpus()/put_online_cpus() preempt_disable() prevent cpu_hotplug_task from exiting. Thanks, Lai --
If the cpu_down() is the caller of cpu_hotplug_begin/done, then yes. But unless I missed something, nothing protects from cpu_up() which takes this lock too. Just in case... I am not saying this is really possible in practice. Oleg. --
But, I must admit, I'd like to avoid adding the new member to task_struct.
What do you think about the code below?
I didn't even try to compile it, just to explain what I mean.
In short: we have the per-cpu fast counters, plus the slow counter
which is only used when cpu_hotplug_begin() is in progress.
Oleg.
static DEFINE_PER_CPU(long, cpuhp_fast_ctr);
static struct task_struct *cpuhp_writer;
static DEFINE_MUTEX(cpuhp_slow_lock)
static long cpuhp_slow_ctr;
static bool update_fast_ctr(int inc)
{
bool success = true;
preempt_disable();
if (likely(!cpuhp_writer))
__get_cpu_var(cpuhp_fast_ctr) += inc;
else if (cpuhp_writer != current)
success = false;
preempt_enable();
return success;
}
void get_online_cpus(void)
{
if (likely(update_fast_ctr(+1));
return;
mutex_lock(&cpuhp_slow_lock);
cpuhp_slow_ctr++;
mutex_unlock(&cpuhp_slow_lock);
}
void put_online_cpus(void)
{
if (likely(update_fast_ctr(-1));
return;
mutex_lock(&cpuhp_slow_lock);
if (!--cpuhp_slow_ctr && cpuhp_writer)
wake_up_process(cpuhp_writer);
mutex_unlock(&cpuhp_slow_lock);
}
static void clear_fast_ctr(void)
{
long total = 0;
int cpu;
for_each_possible_cpu(cpu) {
total += per_cpu(cpuhp_fast_ctr, cpu);
per_cpu(cpuhp_fast_ctr, cpu) = 0;
}
return total;
}
static void cpu_hotplug_begin(void)
{
cpuhp_writer = current;
synchronize_sched();
/* Nobody except us can use can use cpuhp_fast_ctr */
mutex_lock(&cpuhp_slow_lock);
cpuhp_slow_ctr += clear_fast_ctr();
while (cpuhp_slow_ctr) {
__set_current_state(TASK_UNINTERRUPTIBLE);
mutex_unlock(&&cpuhp_slow_lock);
schedule();
mutex_lock(&cpuhp_slow_lock);
}
}
static void cpu_hotplug_done(void)
{
cpuhp_writer = NULL;
mutex_unlock(&cpuhp_slow_lock);
}
--
get_online_cpus() in your code is still read-preference. I wish we quit this ability of get_online_cpus(). Lai. --
Because read-preference RWL will cause write site starvation.
A user run the following code will cause cpuhotplug starvation.
(100 processes run sched_setaffinity().)
Lai
#define _GNU_SOURCE
#include <sched.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#define NCPU 4
#define NPROCESS 100
cpu_set_t set;
pid_t target;
void stress_test(void)
{
int cpu;
srand((int)target);
for (;;) {
cpu = rand() % NCPU;
CPU_SET(cpu, &set);
sched_setaffinity(target, sizeof(set), &set);
CPU_CLR(cpu, &set);
}
}
int main(int argc, char *argv[])
{
pid_t ret;
int i;
target = getpid();
for (i = 1; i < NPROCESS; i++) {
ret = fork();
if (ret < 0)
break;
else if (ret)
target = ret;
else
stress_test();
}
stress_test();
}
--
Sure, but why is that a problem? Hotplug should be a rare event, who cares if it takes a while to come through. --
It is totally starvation, hotplug can not complete. I think it is a kind of DOS attack. --
Heh. Since I never read the changelogs, I didn't even notice this was one of the goals of your patch. I thought this is just the side-effect. Yes, if we want to block the new readers, I don't see how it is possible to avoid the counter in task_struct. I can't judge whether this new member worth the trouble. Once again, I am not arguing, just I don't know. And I think your patch is correct (apart from pure theoretical race with cpu_hotplug_done afaics). Oleg. --
