Re: [PATCH 2/2] cpuhotplug: make get_online_cpus() scalability by using percpu counter

Previous thread: [PATCH 1/2] cpuhotplug: do not need cpu_hotplug_begin() when CONFIG_HOTPLUG_CPU=n by Lai Jiangshan on Monday, April 5, 2010 - 3:37 am. (1 message)

Next thread: [announce] atang tree has been rebased on top of 2.6.33 by Bartlomiej Zolnierkiewicz on Monday, April 5, 2010 - 4:12 am. (1 message)
From: Lai Jiangshan
Date: Monday, April 5, 2010 - 3:38 am

Current get_online_cpus() acquires a mutex lock and then
release it. It is not scale and it hurts cache. This patch rewrite it.

1) get_online_cpus() must be allowed to be called recursively, so I added
   get_online_cpus_nest for every task for new code.

   This patch just allows get_online_cpus() to be called recursively,
   but when it is not nested, get_online_cpus() will wait until
   cpuhotplug finished, so the potential starvation is avoided.

   And, the livelock of cpu_hotplug_begin() is avoided, so the comment
   is removed.

2) This new code use PER_CPU counters, and this counters protected
   by RCU. These counters acts like the reference counters of a modules.
   (Actually, all these code is stolen from module.c: try_refcount_get()
   is stolen from try_module_get(), put_online_cpus() from module_put()...)

   After this patch applied, get_online_cpus() is very light and scale when
   cpuhotplug is not running. It just disables preemption and increase
   the cpu counter and then enables preemption.

3) Since we have try_refcount_get(), I add a new API try_get_online_cpus().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 include/linux/cpu.h   |    2
 include/linux/sched.h |    3 +
 kernel/cpu.c          |  131 ++++++++++++++++++++++++++++++++------------------
 kernel/fork.c         |    3 +
 4 files changed, 94 insertions(+), 45 deletions(-)
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index e287863..a32809c 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -112,6 +112,7 @@ extern struct sysdev_class cpu_sysdev_class;
 
 extern void get_online_cpus(void);
 extern void put_online_cpus(void);
+extern int try_get_online_cpus(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -134,6 +135,7 @@ static inline void cpu_hotplug_driver_unlock(void)
 
 #define get_online_cpus()	do { ...
From: Oleg Nesterov
Date: Monday, April 5, 2010 - 9:29 am

Well, iirc one of the goals of

	cpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus()
	86ef5c9a8edd78e6bf92879f32329d89b2d55b5a

was avoiding the new members in task_struct. I leave this up to you
and Gautham.


Lai, I didn't read this patch carefully yet (and I can't apply it to

This looks unsafe. In theory nothing protects cpu_hotplug_task from
exiting if refcount_sum() becomes zero, this means wake_up_process()
can hit the freed/reused/unmapped task_struct. Probably cpu_hotplug_done()
needs another synhronize_sched() before return.

OTOH, I do not understand why the result of __get_cpu_var(refcount)
must be visible to refcount_sum() if we race with cpu_hotplug_begin(),
so it seems to me cpu_hotplug_begin() also needs synchronize_sched()
before refcount_sum().

Oleg.

--

From: Oleg Nesterov
Date: Tuesday, April 6, 2010 - 5:00 am

No, I misread the unapplied patch, sorry for noise.

Oleg.

--

From: Lai Jiangshan
Date: Wednesday, April 7, 2010 - 6:35 am

Old get_online_cpus() is read-preference, I think the goal of this ability
is allow get_online_cpus()/put_online_cpus() to be called nested.

But read-preference RWL may cause write side starvation, so I abandon this ability,
and use per-task counter for allowing get_online_cpus()/put_online_cpus()

preempt_disable() prevent cpu_hotplug_task from exiting.


Thanks, Lai
--

From: Oleg Nesterov
Date: Wednesday, April 7, 2010 - 6:54 am

If the cpu_down() is the caller of cpu_hotplug_begin/done, then yes.

But unless I missed something, nothing protects from cpu_up() which
takes this lock too.

Just in case... I am not saying this is really possible in practice.

Oleg.

--

From: Oleg Nesterov
Date: Friday, April 9, 2010 - 5:12 am

But, I must admit, I'd like to avoid adding the new member to task_struct.

What do you think about the code below?

I didn't even try to compile it, just to explain what I mean.

In short: we have the per-cpu fast counters, plus the slow counter
which is only used when cpu_hotplug_begin() is in progress.

Oleg.


static DEFINE_PER_CPU(long, cpuhp_fast_ctr);
static struct task_struct *cpuhp_writer;
static DEFINE_MUTEX(cpuhp_slow_lock)
static long cpuhp_slow_ctr;

static bool update_fast_ctr(int inc)
{
	bool success = true;

	preempt_disable();
	if (likely(!cpuhp_writer))
		__get_cpu_var(cpuhp_fast_ctr) += inc;
	else if (cpuhp_writer != current)
		success = false;
	preempt_enable();

	return success;
}

void get_online_cpus(void)
{
	if (likely(update_fast_ctr(+1));
		return;

	mutex_lock(&cpuhp_slow_lock);
	cpuhp_slow_ctr++;
	mutex_unlock(&cpuhp_slow_lock);
}

void put_online_cpus(void)
{
	if (likely(update_fast_ctr(-1));
		return;

	mutex_lock(&cpuhp_slow_lock);
	if (!--cpuhp_slow_ctr && cpuhp_writer)
		wake_up_process(cpuhp_writer);
	mutex_unlock(&cpuhp_slow_lock);
}

static void clear_fast_ctr(void)
{
	long total = 0;
	int cpu;

	for_each_possible_cpu(cpu) {
		total += per_cpu(cpuhp_fast_ctr, cpu);
		per_cpu(cpuhp_fast_ctr, cpu) = 0;
	}

	return total;
}

static void cpu_hotplug_begin(void)
{
	cpuhp_writer = current;
	synchronize_sched();

	/* Nobody except us can use can use cpuhp_fast_ctr */

	mutex_lock(&cpuhp_slow_lock);
	cpuhp_slow_ctr += clear_fast_ctr();

	while (cpuhp_slow_ctr) {
		__set_current_state(TASK_UNINTERRUPTIBLE);
		mutex_unlock(&&cpuhp_slow_lock);
		schedule();
		mutex_lock(&cpuhp_slow_lock);
	}
}

static void cpu_hotplug_done(void)
{
	cpuhp_writer = NULL;
	mutex_unlock(&cpuhp_slow_lock);
}

--

From: Lai Jiangshan
Date: Monday, April 12, 2010 - 2:24 am

get_online_cpus() in your code is still read-preference.
I wish we quit this ability of get_online_cpus().

Lai.
--

From: Peter Zijlstra
Date: Monday, April 12, 2010 - 2:28 am

Why?
--

From: Lai Jiangshan
Date: Monday, April 12, 2010 - 5:30 am

Because read-preference RWL will cause write site starvation.

A user run the following code will cause cpuhotplug starvation.
(100 processes run sched_setaffinity().)

Lai

#define _GNU_SOURCE
#include <sched.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>

#define NCPU 4
#define NPROCESS 100

cpu_set_t set;
pid_t target;

void stress_test(void)
{
	int cpu;

	srand((int)target);
	for (;;) {
		cpu = rand() % NCPU;
		CPU_SET(cpu, &set);
		sched_setaffinity(target, sizeof(set), &set);
		CPU_CLR(cpu, &set);
	}
}

int main(int argc, char *argv[])
{
	pid_t ret;
	int i;

	target = getpid();
	for (i = 1; i < NPROCESS; i++) {
		ret = fork();
		if (ret < 0)
			break;
		else if (ret)
			target = ret;
		else
			stress_test();
	}

	stress_test();
}



--

From: Peter Zijlstra
Date: Monday, April 12, 2010 - 5:34 am

Sure, but why is that a problem? Hotplug should be a rare event, who
cares if it takes a while to come through.


--

From: Lai Jiangshan
Date: Monday, April 12, 2010 - 6:47 pm

It is totally starvation, hotplug can not complete.
I think it is a kind of DOS attack.
--

From: Oleg Nesterov
Date: Monday, April 12, 2010 - 11:16 am

Heh. Since I never read the changelogs, I didn't even notice this was
one of the goals of your patch. I thought this is just the side-effect.

Yes, if we want to block the new readers, I don't see how it is possible
to avoid the counter in task_struct.

I can't judge whether this new member worth the trouble. Once again, I am
not arguing, just I don't know. And I think your patch is correct (apart
from pure theoretical race with cpu_hotplug_done afaics).

Oleg.

--

Previous thread: [PATCH 1/2] cpuhotplug: do not need cpu_hotplug_begin() when CONFIG_HOTPLUG_CPU=n by Lai Jiangshan on Monday, April 5, 2010 - 3:37 am. (1 message)

Next thread: [announce] atang tree has been rebased on top of 2.6.33 by Bartlomiej Zolnierkiewicz on Monday, April 5, 2010 - 4:12 am. (1 message)